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EXECUTIVE  SUMMARY 


PART  I  REDUNDANCY  AND  FAULT  TOLERANT  COMPUTER  ARCHITECTURE 


Part  I  of  this  chapter  deals  with  redundancy  and  its  framework.  The 
framework  of  redundancy  consists  of  (i)  modeling  and  evaluation  of  the  re¬ 
dundancy  constructs,  and  (ii)  the  embodying  of  the  constructs  in  fault- 
tolerant  computer  architecture. 

Mathematical  modeling  of  redundancy  constructs  permits  their  quantita¬ 
tive  evaluation  and  provides  a  numeric  basis  for  critical  comparison. 

Case  histories  of  fault-tolerant  computer  architecture  illustrate,  by 
the  design  selection  of  particular  redundancy  constructs  from  the  repertoire 
of  constructs,  the  relative  significance  that  the  designer  placed  on  specific 
redundancy  constructs  in  relation  to  their  functional  environment  in  the  ar¬ 
chitecture. 

In  general,  a  system  if  designed  in  such  a  manner  that  only  the  abso¬ 
lute  minimum  amounts  of  hardware  is  utilized  to  implement  its  function  is  said 
to  be  non-redundant  or  is  said  to  have  a  simplex  structure .  If  even  after 
utilizing  the  finest  components  available  the  desired  system  reliability  is  not 
achieved  or  if  failure-tolerance  is  desired  as  a  system  capability  then  redun¬ 
dancy  as  a  design  procedure  is  restored  too,  i.e.,  more  system  elements  are 
used  than  were  absolutely  necessary  to  realize  all  the  system's  functions  (ex¬ 
cepting  for  the  attributes  of  reliability  and  fault-tolerance).  The  additional 
system  elements,  referred  to  as  the  redundant  elements,  need  not  all  necessar¬ 
ily  be  hardware  elements  but  may  also  be  additional  software  (software  redun¬ 
dancy),  additional  time  (time  redundancy)  and  additional  information  (infor¬ 
mation  redundancy).  Examples  of  the  latter  are  the  application  of  error- 
detection  and  correction  codes. 

Naturally,  the  hardware,  software,  and  time  redundancy  are  often  inter¬ 
related.  Additional  software  requires  additional  memory  storage  and  additional 
time  is  used  to  execute  the  added  software.  The  term  protective  redundancy  is 
often  used  to  characterize  that  redundancy  which  has  an  overall  beneficial  ef¬ 
fect  on  the  system  attributes  since  redundancy  alone  without  proper  applica¬ 
tion  may  well  become  a  liability.  Protective  redundancy  is  utilized  to  realize 

5 


fault -tolerant  digital  systems  and  self -repairing  systems  by  such  me’ans  as 
triple  or  N-tuple  modular  redundancy  (TMR,  NMR),  quadded  redundancy,  standby- 
replacement  redundancy,  hybrid  redundancy,  software  redundancy  and  the  appli¬ 
cation  of  error-detection  and  correction  codes. 

Redundancy  as  a  procedure,  for  designing  more  reliable  system  than  al¬ 
lowed  by  the  intrinsic  reliability  of  the  constituting  components,  is  as  old 
as  the  discipline  of  engineering  itself.  Examples  of  the  use  of  redundancy 
in  ancient  times  is  provided  in  the  civil  engineering  construction  where  more 
than  the  absolutely  minimum  redundancy  were  used  as  insurance  against  (i)  the 
lack  of  accurate  knowledge  of  underlying  phenomena,  and  (ii)  the  lack  of  con¬ 
fidence  in  the  available  data  on  materials  used.  Redundancy  as  a  procedure  is 
even  more  basic.  This  is  evidenced  by  the  testimony  of  evolutionaly  processes 
of  life  which  make  abundant  use  of  it  (e.g.,  in  the  human  body  there  are  two 
kidneys,  two  lungs,  two  cerebral  hemispheres,  etc.). 

For  the  computer  age,  redundancy  has  been  used  at  all  levels  of  tech¬ 
nology,  from  that  of  VLSI  devices,  circuitry,  logic,  subsystems,  computers, 
and  even  to  entire  networks  of  digital  systems. 

Part  I  of  this  chapter  spans  the  general  area  of  fault-tolerant  systems, 
the  utilization  of  the  various  protective  redundant  structures  as  basic  build¬ 
ing  blocks  for  fault-tolerant  digital  computing  systems  have  been  described 
and  evaluated  comparatively.  A  unifying  notation  for  characterizing  the  most 
commonly  used  protective  redundancy  schemes  has  been  presented.  It  has  also 
been  demonstrated  that  the  k-out-of-N  redundant  model  subsumes  either  directly 
or  by  composition  a  great  number  of  other  redundant  structures. 

By  employing  reliability  analysis  to  these  fault-tolerant  systems,  their 
overall  reliability  can  be  measured  and  compared. 

PART  II  SELF-CHECKING  CIRCUITS 

Self-checking  circuits  by  definition  pertain  to  circuits  whose  outputs 
are  encoded  in  an  error-detecting  code.  In  Part  II  of  this  chapter  the  under¬ 
lying  theory  based  on  code  spaces  is  developed  to  present  the  notions  of  self¬ 
checking  circuits,  partially  self-checking  circuits,  totally  self-checking  cir¬ 
cuits,  and  totally  self-checking  networks.  An  introduction  to  Morphic  Boolean 
logic  is  also  presented  which  is  an  aid  to  the  design  of  self-checking  checkers. 
These  are  presented  with  examples  and  illustrations. 
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PART  III 


CODING  TECHNIQUES 


Part  III  of  this  chapter  deals  with  coding  techniques  used  to  achieve 
concurrent  diagnosis  in  digital  computing  systems.  Coding  theory  is  the  body 
of. knowledge  dealing  with  the  science  of  redundantly  encoding  data  so  that 
errors  can  be  detected  and  with  further  encoding  even  corrected. 

The  fundamental  principles  underlying  transmisssion  codes  as  well  as 
arithmetic  codes  are  developed  and  illustrated  by  the  use  of  short  simple  ex¬ 
amples.  Both  error  detection  as  well  as  error  correction  properties  are  treated 
and  the  tradeoffs  between  these  are  explained. 

The  use  of  residue  codes  for  protecting  instruction  words  in  the  JPL- 
STAR  computer  is  given  as  a  real  example. 

Coding  theory  is  a  very  rich  and  by  far  the  most  developed  branch  of 
fault-tolerant  computing.  The  theoretical  basis,  the  functional  limits  of  re¬ 
liable  communication  for  a  given  channel,  and  the  mathematical  tools  and  classi¬ 
fication  schemes  are  well  established.  This  section  does  not  attempt  to  be  an 
exhaustive  evaluation,  the  emphasis  taken  is  to  highlight  the  essential  prin¬ 
ciples  by  means  of  short  examples.  For  the  more  interested  practitioner  poin¬ 
ters  are  provided  to  the  literature. 
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PART  I 


REDUNDANCY  AND  FAULT  TOLERANT  COMPUTER  ARCHITECTURE 


1.0  INTRODUCTION 

A  fault-tolerant  computer  is  a  computer  organized  and  structured  such 
that  it  can  perform  its  design  specified  functions  even  in  the  presence  of 
hardware  failures. 

By  the  sheer  force  of  necessity  an  important  attribute  in  computer 
architecture  is  reliability  and  fault-tolerance.  Historically,  the  early 
days  of  computers,  because  of  the  unreliability  of  thermonic  devices,  were 
extremely  innovative  and  productive  of  fault-toierance  techniques,  many  of 
which  we  take  for  granted  these  days,  e.g.,  parity  checking,  retrying  of 
operations,  duplexing  of  processors,  etc.  Then,  with  the  advent  of  semi¬ 
conductors  and  their  greater  inherent  reliability  fault-tolerance  was  no 
longer  a  pressing  issue,  and  undiverted  effort  was  allocated  to  the  enhance¬ 
ment  of  computational  architectures.  Subsequently,  the  space  age  and  com¬ 
puter-oriented  national  defense  needs  again  shifted  the  equilibrium  between 
intrinsic  component  reliabilities  on  the  one  hand  and  the  sheer  size  and 
complexity  and  hazardous  application  environments  of  the  fabricated  struc¬ 
tures  on  the  other.  These  demands  on  reliability,  continuous  service,  and 
hardware  integrity  spawned  a  new  breed  of  computer  architecture  entitled 
"fault- tolerant  computers." 

This  chapter  surveys  the  various  techniques  employed  in  fault- 
tolerant  architectures  from  the  point  of  view  of  the  protective  redundancy 
structures  utilized,  and  points  out  the  significant  features  unique  to  the 
implementation  of  fault-tolerance  in  either  hardware,  microprogramming,  or 
software. 

The  domain  of  reliability  engineering  involves  considerations  of  all 
aspects  of  design,  development  and  fabrication,  so  as  to  minimize  the  chance 
of  equipment  breakdown.  Neglect  of  reliability  considerations  can  prove  to 
be  very  costly,  from  the  loss  of  consumer  accptance  of  the  product  to  mis¬ 
sions  such  as  rocket  launching  of  spacecrafts  which  depend  heavily  on  relia¬ 
bility  engineering.  Failure  of  a  single  component  could  result  in  the  total 
loss  of  the  system. 
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Reliability  in  a  qualitative  sense  can  mean  a  host  of  different 
things  relating  to  the  confidence  in  the  goodness  of  the  equipment,  and 
is  closely  connected,  but  often  confused  with  the  concepts  of  maintain¬ 
ability,  availability,  safety  and  even  security  of  the  system.  Quanti¬ 
tatively  reliability  can  be  formulated  mathematically  as  the  probability 
that  the  system  will  perform  its  intended  function  over  the  stated  dura¬ 
tion  of  time  in  the  specified  environment  for  its  usage. 

As  equipment  becomes  more  complex  the  chances  of  system  unreliabil¬ 
ity  becomes  greater,  since  the  reliability  of  an  equipment  depends  on  the 
reliability  of  its  components.  The  relationship  between  parts  reliability 
and  the  system  reliability  can  be  formulated  mathematically  to  varying  de¬ 
grees  of  precision  depending  on  the  scale  of  the  modeling  effort.  The 
mathematics  of  reliability  is  based  on  parts  failure  rate  statistics  and 
probability  theoretic  relationships.  The  mathematical  theory  of  relia¬ 
bility  is  used  to  model,  simulate  and  predict  the  equipment's  proneness 
to  failure  under  expected  operating  conditions. 

There  have  been  two  distinct  and  viable  approaches  taken  to  enhance 
system  reliability.  One  is  based  on  component  technology,  i.e.,  manufac¬ 
turing  the  component  as  intrinsicly  reliable  as  possible  followed  by  parts 
screening,  quality  control,  pretesting  to  remove  early  failures  (infant 
mortality  effects),  etc.  The  second  approach  is  based  on  the  organization 
of  the  system  itself,  e.g.,  fault-tolerant  architectures  where  the  archi¬ 
tecture  makes  use  of  protective  redundancy  to  mask  or  remove  the  effects 
of  failure,  and  thereby  provide  greater  overall  system  reliability  than 
would  be  possible  by  the  use  of  the  same  components  in  a  simplex  or  non- 
redundant  configuration. 

Fault-tolerance  is  the  capability  of  the  system  to  perform  its  func¬ 
tions  to  its  design  specifications  even  in  the  presence  of  hardware  failures. 
If,  in  the  event  of  faults,  the  system's  functions  may  be  performed  but  do 
not  meet  the  design  specifications  with  respect  to  the  time  required  to  com¬ 
plete  the  job  or  the  storage  capacity  required  for  the  job,  then  the  system 
is  said  to  be  partial  or  quasi  fault-tolerant.  Since  the  number  of  possible 
hardware  failures  can  be  very  large,  in  practice  it  is  necessary  to  restrict 
fault- tolerance  to  prespecified  classes  of  faults  from  which  the  system  is 
designed  to  recover. 
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Faults  may  be  classified  as  transient  or  permanent,  deterministic 
or  indeterminate,  local  or  catastrophic.  The  first  category  refers  to 
the  duration  of  the  fault,  the  second  to  its  effect  on  the  values  of  the 
system  design  parameters  and  the  third  to  the  propagation  of  the  fault  to 
its  neighboring  elements. 

Fault-tolerance  is  provided  by  the  application  of  protective  redun¬ 
dancy  —  use  of  more  resources  so  as  to  upgrade  system  reliabiitiy.  These 
resources  may  consist  of  more  hardware,  software  or  more  time  or  combina¬ 
tion  of  all  of  these.  Extra  time  is  required  to  retransmit  messages  or  to 
reexecute  programs,  extra  software  is  required  to  perform  diagnosis  on  the 
hardware,  extra  hardware  is  required  to  provide  replication  of  units. 

Hardware  redundancy  may  be  of  the  fault-masking  or  self-repair  types 
or  a  hybrid  of  these  two.  In  fault-masking,  redundancy  is  of  a  static  nature, 
faults  are  masked  instantly  and  the  operations  of  fault  detection,  location 
and  correction  are  indistinguishable.  In  self-repair,  redundancy  is  used 
dynamically,  faults  are  selectively  masked,  and  are  detected,  located  and 
subsequently  corrected  by  the  replacement  of  the  failed  uni„  by  an  unfailed 
replica.  Examples  of  the  former  are  Triple  Modular  Redundancy  (TMR)  and 
quadding,  and  of  the  latter  standby-replacement  (SR)  systems  and  reconfig- 
urable  systems.  Schemes  using  combinations  of  these  two  basic  approaches 
are  called  hybrid  or  adaptive  redundancy. 

1.1  SOME  FUNDAMENTAL  PRINCIPLES 


The  fundamental  principle  of  reliability  is  that  reliability  is  not 
solely  inherent  to  a  component  but  is  also  a  function  of  how  the  component 
is  used.  Another  fundamental  principle  of  achieving  reliability  by  means 
of  protective  redundancy  is  that  redundancy  be  applied  to  the  smallest  level 
of  complexity  of  the  system  in  order  to  maximize  gain  in  reliability.  This 
is  an  idealized  statement  since,  in  practice,  there  are  tradeoffs  due  to 
overhead  required  in  utilizing  redundancy  techniques,  e.g.,  providing  voters 
in  TMR  systems  and  detection-switching  requirements  in  standby-sparing  sys¬ 
tems.  The  application  of  mathematical  thoery  of  reliability  to  model  such 
systems  provides  quantitative  design  guidelines  to  make  such  tradeoffs  and 
optimizations  in  practice. 
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If  the  above  are  the  first  and  second  principles  of  fault-tolerance, 
then  the  third  principle  states  that  a  system  may  be  made  arbitrarily  re¬ 
liable  provided  that  the  degree  or  redundancy  is  made  high,  i.e.,  a  suf¬ 
ficiently  large  number  of  replicas  are  provided.  Again  this  principle  holds 
only  in  an  idealized  situation;  in  practice,  since  the  probability  of  de¬ 
tecting  a  failure  and  correctly  switching  in  a  spare  is  less  than  unity, 
this  parameter,  called  coverage,  limits  the  advantages  postulated  by  the 
third  principle. 

A  fourth  principle  concerns  the  problem  of  requiring  the  checking 
elements  (those  elements  that  are  used  for  the  diagnosis  of  the  rest  of 
the  system  and  the  subsequent  reconfiguration  of  the  system  units)  also  to 
be  checkable.  This  is  the  problem  of  "checking  the  checker."  Thus,  the 
fourth  principle  is  formulated  to  state  that  any  system  utilizing  protec¬ 
tive  redundancy  will  have  major  and  minor  "hardcores"  (i.e.,  unprotected 
system  elements)  and  that  these  cannot  be  totally  eliminated  from  the  sys¬ 
tem  design,  however,  they  may  be  made  arbitrarily  small  by  the  judicious 
use  of  a  mixture  of  different  protective  redundancy  techniques. 

1.2  MATHEMATICAL  THEORY  OF  RELIABILITY 


Some  relationships  between  reliability  parameters  and  the  underlying 
probability  theoretic  relationships  are  as  follows.  If  a  fixed  large  num¬ 
ber  Nq  of  identical  items  is  being  tested  of  which  Ng  is  the  number  of 
items  surviving  after  time  t,  the  number  of  items  which  failed  during 
time  t  then,  for  all  t,  =  Ns+Nj..  Now,  for  a  sufficiently  large  N^,  the 
reliability  R(t)  of  an  item  is  Ns/NQ.  The  failure  rate  A(t),  which  is  de¬ 
fined  to  be  the  rate  at  which  the  population  changes  at  time  t,  can  be 
3hown  to  be  given  by 


A(t) 


1  dR(t) 
R ( t )  dt 


(1) 


so  that 


(2) 
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The  reliability  function  R ( t )  is  often  called  the  survival  probab¬ 
ility  function  since  it  measures  the  probability  that  failure  of  an  item 
does  not  occur  during  the  time  interval  [0,t]. 

1.2.1  Failure  Rate 

Statistical  data  on  equipment  failure  yields  a  characteristic  "bath 
tub"  curve  as  shown  in  Figure  1 .  When  the  equipment  is  first  put  into  ser¬ 
vice  inherently  weak  components  fail  early;  this  stage  is  also  called  "in¬ 
fant  mortality."  Subsequently  the  failure  rate  stabilizes  quickly  to  a 
relatively  constant  value;  this  period  is  called  the  useful  life  period. 
After  much  usage  failure  rate  begins  to  increase  rapidly  due  to  deterior¬ 
ation  and  wearout. 


EARLY 

FAILURE 

PERIOD 


FAILURE  f 
RATE  I 


CONSTANT 

FAILURE 

RATE 


WEAROUT 

$tleure 

PERIOD 


USEFUL 

LIFE 


^  TIME 


Figure  1.  Bath-tub  curve  of  failure  rate. 


1.2.2  Exponential  Failure  Law 

In  general  the  failare  law  of  a  component  is  the  probabilitv  dis¬ 
tribution  obeyed  from  the  moment  at  which  a  component  enters  service  up 
to  the  moment  of  its  failure.  In  practice  the  most  commonly  used  failure 
law  is  the  exponential  law,  which  applies  when  a  component  is  subject  only 
to  failures  which  occur  at  random  intervals  and  the  average  number  of  fail¬ 
ures  is  the  same  for  equal  time  periods.  These  constraints  are  valid  for 
a  component  which  is  no  longer  subject  to  infant  mortality  failures  and 
whose  failure  rate  is  a  constant  within  the  "useful-life"  span.  Thus,  for 
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operating  periods  within  the  useful  life,  the  component  reliability  over 
a  period  of  time  t  can  be  expressed  as  R { t )  =  e"^  where  X  (usually  ex¬ 
pressed  in  failures  per  hour  or  per  million  hours)  is  the  constant  fail¬ 
ure  rate  of  the  device.  A  characteristic  of  the  exponential  failure  law 
is  that,  within  the  useful  life  period,  the  reliability  of  the  device  is 
the  same  for  operating  times  of  equal  duration. 

From  the  definition  of  R <  t )  it  follows  that  the' mean  time  between 

failures  (MTBF)  or  the  mean  time  to  first  failure  (MTTF),  usually  expressed 

oo 

in  hours,  are  given  by  /  R(t)dt,  i.e.,  it  is  the  area  underneath  the  relia¬ 

nt} 

bility  curve  R(t)  plotted  versus  t.  This  result  is  true  for  any  failure 
distribution.  For  the  specific  case  of  the  exponential  failure  law  the 
MTBF,  m,  is  equal  to  1/X.  Further,  when  the  product  Xt  is  small,  the 
equation  for  R(t)  may  be  approximated  by  R(t)%1-Xt.  Thus,  if  Xt=0.01, 
R(t)  =e-0’01  =0.99  or  99.0  percent.  The  product  Xt  is  often  referred  to 
as  the  "normalized"  time,  since  Xt  = t/m,  i.e.,  the  mission  time  t  normal¬ 
ized  with  respect  to  the  MTBF. 

2.0  PRINCIPAL  REDUNDANCY  STRUCTURES  AND  THEIR  MODELS 
2. 1  SERIES  RELIABILITY 

If  a  system  is  composed  of  elements  in  such  a  way  that  the  failure 
of  any  one  element  causes  a  failure  of  the  system,  then  these  elements  are 
considered  to  be  functionally  in  series.  For  the  system  to  survive  each 
element  must  survive.  The  probability  of  survival  for  the  system  cannot 
be  better  than  the  element  with  the  lowest  probability  of  survival;  e.g., 
a  chain  is  no  better  than  its  weakest  link.  When  these  series  elements  are 
independent  of  each  other  then,  by  the  probability  multiplication  law,  the 
system  survival  probability  is  the  product  of  the  individual  survival  prob- 

n 

abilities  of  the  elements,  i.e.,  RSyStem  =  2J  ^  where  Ri  is  the  reliabil- 

LU 

ity  of  the  i^  element  of  an  n  element  system. 


2.2  PARALLEL  RELIABILITY 


Parallel  reliability  is  an  illustration  of  protective  redundancy. 

The  system  is  composed  of  functionally  parallel  elements  in  such  a  way 
that  if  one  of  the  elements  fails  the  parallel  unit  will  continue  to  do 
the  system  function. 

The  system  reliability  under  the  assumption  of  independence  of  fail¬ 
ure  of  the  elements  is  expressed  by 


system 


=  1  -  (1-R)1 


which  is  the  probability  that  not  all  the  n  elements  have  failed.  The 
term  (1-R),  known  as  the  unreliability  of  a  unit,  is  the  probability  that 
a  unit  will  fail. 


2.3  TRIPLE  MODULAR  REDUNDANCY  (TMR) 

A  TMR  system  is  also  known  as  the  multiple-line  voting  system  (see 
Figure  2).  One  of  the  earliest  and  most  influential  schemes  was  developed 
by  J.  von  Neumann  [1],  The  simplex  unit  is  triplicated  and  each  of  the 
three  independent  units  feed  into  a  majority  voter  which  outputs  the  major¬ 
ity  signal.  The  system  fails  if  more  than  one  unit  fails  in  which  case  the 
failed  units  outvote  the  good  one.  This  scheme  is  generalized  to  N-modular 
redundancy  (NMR)  where  N  is  any  odd  number  of  units.  Various  schemes  of 
protecting  the  voter  are  available  and  also  various  other  variants  of  the 
basic  TMR  strategy  have  been  developed.  The  TMR  system  reliability  is  ex¬ 
pressed  as 


R  _  =  [R- 

system 


3R  ( 1-R ) ]R 


which  is  the  product  of  the  reliability  Ry,  the  voter  reliability,  and 
the  reliability  of  the  idealized  TMR  system.  The  idealized  TMR  system 
reliability  is  the  sum  of  the  probabilities  of  the  two  events  that  (i) 

3 

all  three  units  survive,  R  and  (ii)  that  at  least  any  two  units  survive 

2 

and  at  most  one  unit  fails,  3R  (1-R). 
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R 


Figure  2.  TMR  system 


2. A  QUADD  REDUNDANCY 

Quadding  is  an  illustration  of  component  redundancy  and  is  similar 
in  concept  to  TMR.  The  major  difference  is  that  the  voting  or  restora¬ 
tion  or  fault-masking  functions  are  distributed  into  the  network  and  are 
not  separable  as  in  TMR.  An  example  of  quadding  is  shown  in  Figure  3 
where  the  non-redundant  logic  circuit  in  Figure  3a  is  shown  "quadded"  in 
Figure  3b.  The  process  of  how  an  error  downstream  is  subsequently  cor¬ 
rected  upstream  is  illustrated.  In  general  the  quadding  procedure  re¬ 
quires  that  each  logic  gate  be  quadriplicated  and  that  each  of  the  gates 
in  a  quadd  stage  will  have  twice  as  many  inputs  as  the  non-redundant  gates 
replaced.  The  outputs  of  a  stage  are  interconnected  to  the  inputs  of  the 
succeeding  stage  by  an  interconnection  pattern  such  that  the  effects  of 
errors  in  earlier  stages  gets  subsequently  "restored”  in  the  latter  stages, 
i.e.,  the  originally  "good"  signal  is  restored. 


2.5  STANDBY  REPLACEMENT  REDUNDANCY 

For  standby  replacement  redundancy,  unlike  TMR,  only  one  unit  is  oper¬ 
ational  at  a  time  (see  Figure  A).  When  the  active  unit  fails  this  event  is 
detected  by  additional  circuitry  and  a  spare  unit  from  a  reserve  of  spares 
is  switched-in  to  replace  the  failed  unit  thereby  restoring  the  system  to 
its  operational  state.  The  reliability  of  this  system  is  expressed  as 


R 


system 


1  -  ( 1-R) 


S+1 


which  is  the  probability  that  not  all  units  have  failed. 
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2.6  HYBRID  REDUNDANCY 

Hybrid-redundancy  is  a  synthesis  of  TMR  and  standby  replacement 
redundancy  Isee  Figure  5).  It  consists  of  a  TMR  system  (or  in  general 
an  NMR  system)  with  a  bank  of  spares  such  that  when  one  of  the  TMR  units 
fails,  the  failed  unit  is  replaced  by  a  spare  unit.  Failure  detection  is 
achieved  by  means  of  the  disagreement  detector  which  compares  the  individ¬ 
ual  outputs  of  each  of  the  TMR'd  units  with  the  system  output.  Upon  a  dis¬ 
agreement  the  disagreement  detector  issues  a  signal  to  the  switching  net¬ 
work  to  replace  the  failed  unit  by  a  spare  unit.  At  such  time  as  all  spares 
are  utilized  the  hybrid  redundancy  system  reduces  to  a  TMR  system.  Varia¬ 
tions  of  the  hybrid  or  adaptive  redundancy  schemes  are  available.  The  sys¬ 
tem  reliability  in  its  simplest  terms  may  be  expressed  as 

^system  *  1  - 1  <  !-B>^  ♦  <S.3><1-R>S*2-R] 

which  is  the  probability  that  not  all  S+3  units  fail  and  that  not  any  S+2 
units  fail  with  one  not  failing. 

A  comparison  of  reliability  improvement  and  mean-life  improvements 
of  systems  using  no  redundancy  (simplex  systems),  TMR,  standby  sparing, 
and  hybrid  redundancy  is  presented  by  Mathur  [27], 
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Figure  5.  Hybrid  redundancy 


K-OUT-OF-N  REDUNDANT  ARCHITECTURE 


This  section  gives  a  unified  treatment  of  the  various  protective 
redundancy  structures  described  in  the  preceding.  It  is  contended  that 
by  and  large  all  fault-tolerant  computers  are  particular  cases  of  the 
class  of  partitioned  K-out-of-N  redundant  structures.  The  main  differ¬ 
ences  being  in  (i)  the  degree  of  partitioning  used  and  (ii)  the  means 
of  error-detection  employed. 

The  basic  underlying  structure  of  all  Hardware  Implemented  Fault- 
Tolerant  (HIFT)  systems  is  the  so-called  K-out-of-N  structure.  It  is 
composed  of  a  total  of  N  identical  units.  For  the  structure  to  function 
at  least  K  of  the  N  replicated  units  must  remain  operational.  Hence  its 
name.  This  structure  can  tolerate  up  to  (N-K)  independent  failures,  one 
in  each  of  (N-K)  out  of  the  total  N  units.  Thus,  these  structures  ex¬ 
hibit  a  fault-tolerance,  t  equal  to  N-K.  Here  by  fault-tolerance  is 
meant  the  total  number  of  replicas  that  the  system  can  afford  to  have 
failed  yet  itself  remain  operational. 

If  r  is  the  reliability  (survival  probability)  or  an  individual 
replicated  unit  then  the  reliability  of  the  K-out-of-N  structure  under 
the  assumption  that  failures  are  independent  events  is  given  by  the  ex¬ 
pression 

R( K-out-of-N)  =  £  ( ^r1  { 1-r )N_1 

i  =  1  1 

where  (i)  =  N!/(N-i)!i! 

This  reliability  expression  is  simply  the  summation  of  all  the  suc¬ 
cessful  events,  i.e.,  the  system  survives  provided  K,K+1 ,K+2, . . . ,N-1  or  N 

units  survive.  The  probability  of  exactly  i  units  surviving  is  r.  The 

N-i 

probability  of  exactly  (N-i)  units  having  failed  is  (1-r)  ,  and  the 

number  of  ways  in  which  this  event  can  occur  is  N-combinatorial-i.  The 
summation  of  all  these  events  from  i  =  K  to  N  yields  the  above  general 
expression.  This  powerful  expression  has  a  number  of  special  cases  which 


represent  many  of  the  commonly  used  protectively  redundant  structures. 

These  special  cases  will  now  be  described. 

Case  where  K=N:  Here  all  units  need  to  survive  for  the  structure 
to  survive.  This  is  the  case  when  all  units  are 
in  series  reliability,  and  is  representative  of 
simplex  (i.e.,  non-redundant)  designs.  Here  the 
system  reliability  in  terms  of  the  unit  reliabil¬ 
ity  r  is: 

R(N-out-of-N)  =  r1^ 

and  this  structure  exhibits  zero  fault- tolerance. 

Case  where  K  =  1 :  Here  only  one  unit  of  the  total  N  needs  to  survive 
for  the  structure  to  survive.  This  is  typical  of 
standby-spare  redundancy,  where  one  unit  is  active 
at  any  given  time  and  the  remaining  are  dormant  as 
standbys. 

R( 1-out-of-N)  =1-(1-r)N 

The  above  reliability  expression  for  standby-spares 
states  that  for  the  structure  to  be  functional  not 
all  of  the  N  units  should  have  failed.  Thus  the 
case  K  = 1  represents  a  structure  in  parallel  reli¬ 
ability,  and  exhibits  a  fault-tolerance  of  t=N-1. 

Case  where  K  =2:  The  above  two  cases  of  K  =  1  and  K  =N  are  the  upper 
and  lower  bounds  on  the  K-out-of-N  structures*  Now 
for  the  intermediate  values  of  K.  If  K  =2,  then  the 
structure  survives  provided  that  at  least  two  out  of 
the  N  units  are  operative.  This  is  the  condition  for 
a  Hybrid  redundant  system  having  3  units  in  triple  mod¬ 
ular  redundancy  (TMR)  and  the  remaining  units  as  standby 
spares. 


In  the  Hybrid  redundant  architecture  (hybrid  because  it  combines 
TMR  and  standby-spare  redundant  structures)  three  of  the  total  N  units 
are  operated  in  TMR  and  the  remaining  N-3  units  as  backup  units.  When¬ 
ever  one  of  the  units  composing  the  TMR  structure  fails  it  is  replaced 
by  one  of  the  backup  units.  This  process  would  continue  until  all  back¬ 
up  units  are  exhausted  at  which  time  the  hybrid  structure  reduces  to  a 
TMR  structure.  The  TMR  structure  remains  operative  as  long  as  at  least 
2  out  of  the  three  units,  remain  functional.  Since  only  two  units  were 
required  to  remain  operational  throughout  the  system  life  of  the  struc¬ 
ture  and  is  equivalent  to  the  hybrid  redundant  architecture.  The  system 
reliability  is  given  by 

R(2-out-of-N)  =  1  -  (1-r)N"1[Ur(N-1)] 
the  fault-tolerance  of  a  hybrid  redundant  system  is  equal  to  N-2. 

Case  where  K  =  2  and  N  =  K+ 1 :  It  is  well  known  that  hybrid  redundant 

system  H(3,S)  with  no  spares  is  equiv¬ 
alent  to  a  TMR  configuration.  Thus  in 
the  previous  case  if  the  total  number 
of  units  N  is  three  and  K  is  still  2  we 
then  have  the  classical  von  Neumann  TMR 
system,  i.e.,  for  the  system  to  remain 
operational  at  least  2  out  of  the  three 
total  units  must  remain  operational.  The 
reliability  equation  of  the  TMR  system  is 
given  by 

R(2-out-of-3)  =  r2+3r2(1-r) 
has  a  fault-tolerance,  t  of  one. 

Case  wehre  K  =  (N+1)/2i  The  TMR  structure  can  be  generalized  to  an  N- 

modular  redundant  (NMR)  structure  when  N  is 
any  odd  number  of  units  operating  in  a  major¬ 
ity  configuration,  i.e.,  an  (N+1 ) /2-out-of-N 
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configuration.  The  reliability  expression 
for  the  NMR  system  is 


R((N+1)/2-out-of-N) 


(N-1)/2 

2 

i  =  0 


(!) 


( 1  — r ) 1 : 


N-i 


This  system  is  capable  of  tolerating  (n-2)/2 
failures. 


Case  where  K  =  (n+1)/2  with  n  odd:  This  case  corresponds  to  the  gen¬ 
eralized  hybrid  redundant  architec- 
tecture  having  a  general  nMR  core 
and  (N-n)  spares.  If  the  number  of 
spares  is  zero  then  this  case  re¬ 
duces  to  the  previous  one  for  NMR. 
The  general  hybrid  redundant  archi¬ 
tecture  can  tolerate  (N-n)+(n-1 )/2 
failures. 


Composition  of  K-out-of-N  Structures 

The  fact  that  hybrid  redundancy  is  a  combination  of  TMR  and  standby¬ 
sparing  is  readily  seen  from  the  composition  of  the  following  two  K-out- 


of-N  structures: 

(i)  1-out-of-N  .  standby-sparing 

(ii)  2-out-of-3  .  TMR 

(i)  and  (ii)  are  composed  to  yield: 

(2-out-of-3)-out-of-N+3  .  Hybrid  redundancy 


the  composition  of  (i)  and  (ii)  is  the  Hybrid ( 3, N)  system  using  a  total  of 
N+3  units. 

Similarly  for  the  generalized  Hybrid  redundancy  case,  NMR  redundancy 
and  standby-sparing  can  be  composed  thus: 


(iii)  1-out-of-S  . 

(iv)  (N-1 )/2-out-of-N 


standby-sparing 

NMR 


The  composition  of  (iii)  and  (iv)  yields: 

( (N-1  )/2-out-of-N)-out-of-S+N  . General  hybrid 

This  composition  represents  the  general  hybrid  redundant  structure  of 
H(N,S)  having  a  total  of  N+S  units. 

Similarly  other  redundancy  schemes  can  be  shown  to  have  a  K-out-of 
N  structure.  The  intent  has  not  been  to  exhaustively  list  all  equivalences, 
the  reader  may  readily  try  to  represent  some  of  the  other  redundant  struc¬ 
tures  as  K-out-of-N.  The  structures  described  here  are  summarized  in  Table 
I. 


STRUCTURE 

K 

FAULT-TOLERANCE,  t 

Series 

K  =  N 

0 

Parallel 

K  =  1 

N  -  1 

TMR 

K  =2;  N  =K+1 

1 

NMR 

K  =  (N+U/2;  N  =  odd 

(n-d/2 

Hybrid ( 3, S) 

K  =  2 

N  -2 

Hybrid(n,S) 

K  =  (n+1 )/2;  n  =  odd 

S  +  (n-2)/2 

Table  I:  Summary  of  K-out-of-N  Structures 


It  should  be  noted  that  in  the  above  discussion  no  mention  was  made 
of  the  internal  mechanisms  by  which  errors  are  detected  (errors  imply  fail¬ 
ures  in  the  system)  and  the  means  by  which  the  system  is  reconfigured  to 
remove  the  effect  of  these  failures.  These  will  be  described  in  the  next 
sections  with  reference  to  actual  systems.  However,  in  general,  it-  may 
be  stated  that  all  HIFT  structures  are  particular  cases  of  (i)  K-out-of 
N  structures,  (ii)  self-  or  mutual-composition  of  these,  (iii)  parti¬ 
tioned  K-out-of-N  structures,  (iv)  compound  (series)  combinations  of 
different  K-out-of-N  structures,  and  other  combinations  and  permutations 
of  these. 
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3.0  PARTITIONED  AMr>  BALANCED  FAULT-TOLERANCE 


As  stated  earlier,  one  of  the  fundamental  principles  of  fault-tolerance 
is  that  redundancy  ought  to  be  applied  to  the  smallest  level  of  complexity 
of  the  system,  in  order  to  maximize  gain  in  reliability.  Naturally  the 
overhead  costs  and  associated  unreliabilities  involved  in  implementing  too 
fine  a  level  of  partitioning  dictate  a  compromise.  Another  factor  in  de¬ 
termining  the  partitioning  resolution  or  the  modularization  level  is  the 
occurrence  of  natural  interfaces  in  computer  systems.  Segmentation  of  a 
simplex  computer  design  cannot  be  carried  out  arbitrarily  but  has  to  occur 
at  the  natural  boundaries  between  the  various  functional  subsets  in  order 
to  (i)  simplify  the  intercommunication  between  the  modules,  and  (ii)  to 
provide  the  necessary  degree  of  isolation  from  failure  propagation  between 
one  module  and  another. 

Another  effect  of  partitioning  along  natural  boundary  lines  is  that 
the  resulting  partitioned  functional  modules  will  have  no  two  modules  iden¬ 
tical  (idential  in  the  reliability  sense  of  having  identical  effective  fail¬ 
ure  rates).  The  only  exception  being  memory  modules  which  are  readily  pack- 
ageble  to  4K  or  other  standard  size  modules.  Another  possible  exception  is 
where  the  simplex  computer  has  a  highly  uniform  structured  organization, 
e.g.,  that  of,  say,  parallel  processor  array  computers. 

The  net  effect  from  a  fault-tolerance  view-point,  of  unequal  modules 
is  the  task  of  balancing  fault-tolerance  over  the  entire  system.  Since  a 
chain  is  no  stronger  than  its  weakest  link,  the  architect  has.;  to  strive 
to  have  all  the  sybsystems  that  comprise  the  system  to  be  in  fault-tolerance 
balance.  In  order  then  to  balance  unequally  weighted  subsystems.- ,  the  de¬ 
gree  of  redundancy  applied  will  vary  from  subsystem  to  subsystem.  The  two 
notions  of  level  and  degree  of  redundancy  can  now  be  formalized. 

Level  of  Redundancy:  The  level  of  a  system  to  which  redundancy  is  ap¬ 
plied  refers  to  the  size  of  the  complexity  of 
the  unit  to  be  replicated.  The  finer  the  system 
partitioning,  the  lower  the  level  of  redundancy. 

Degree  of  Redundancy::  At  any  level  of  a  system  the  degree  of  redun¬ 
dancy  refers  to  the  number  of  replicas  pro¬ 
vided  at  that  level. 
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Again,  in  striving  for  a  balanced  fault-tolerant  architecture  where 
all  the  subsystems  are  in  fault-tolerance  equilibrium  the  designer  has  to 
compromise.  The  perfect  equilibrium  cannot  be  reached  but  only  approxi¬ 
mated,  since  each  subsystem  having  an  arbitrary  relationship  to  the  other 
subsystems  cannot  be  "fine-tuned"  by  simply  adjusting  the  number  of  its 
discrete  replicas,  i.e.,  the  reliability  performance  factors  as  a  function 
of  hardware  are  not  continuous  functions  but  discrete  functions. 


4.0  CASE  HISTORIES 

4.1  QUADDING  AND  THE  QAC-PPDS 

One  fault-tolerant  structure,  applied  at  the  component  and  the  logical 
gate  level  rather  than  a  module  or  subsystem  level,  which  does  not  readily 
fit  into  the  class  of  K-out-of-N  structures  is  the  fault-tolerance  process 
of  quadding.  Quadding  has  been  extensively  and  quite  successfully  applied 
in  the  design  of  NASA's  Orbital  Astronomical  Observatory  (OAO)  satellite's 
on-board  primary  processor  and  data  storage  (PPDS)  unit. 

The  PPDS  employs  extensive  component  level  quadd  redundancy.  Its  data 
storage  buffer  and  data  processor  employ  TMR,  and  its  main  memory  is  duplex 
redundant  similar  to  the  Saturn  V  LVDC  (see  later  section  on  Saturn  V  LVDC). 
The  PPDS  receives  commands  to  control  the  satellite's  orientation,  experi¬ 
ments,  and  data  collection. 

Quadding  has  three  major  advantages.  First,  it  is  applicable  at  the 
component  level.  Secondly,  it  is  an  autonomously  redundant  structure.  By 
autonomous  redundancy  is  meant  the  fact  that  no  additional  logic  or  cir¬ 
cuits  are  necessary  to  implement  error  detection,  location  and  reconfigur¬ 
ation.  Thirdly,  as  a  consequence  of  the  second  advantage  it  is  applicable 
to  truly  real  time  systems  with  continuous  availability.  In  contrast  con¬ 
sider  the  self-repairing  standby-spares  system  which  requires  an  effective 
internal  downtime  to  detect  errors,  program  rollback,  retry,  locate  source 
of  error,  and  subsequently  reconfigure. 

Some  of  the  disadvantages  of  the  quaddnig  approach  are  those  of: 


1 .  power  consumption 

2.  fan-out 

3.  wide  tolerances 

4.  difficult  to  test 

5.  difficult  to  evaluate  its  reliability 

6.  expensive 

7.  structure  inflexible  and  unmodifiable. 

Despite  these  disadvantages,  quadding  has  been  successfully  applied  and 
will  continue  to  be  applied  in  selective  applications.  It  may  be  men¬ 
tioned  here  that  the  Saturn  V  LVDC  utilizes  quadding  to  protect  the  de¬ 
coupling  capacitors  in  the  power  distribution  system.  Other  applications 
of  quadding  are  to  protect  component  level  hard-cores  of  self-reconfigur- 
able  computers. 

4.2  TMR  AND  THE  SATURN  V  LVDC 

Basically  triple  modular  redundancy  (TMR),  consists  of  triplicat¬ 
ing  the  simplex  unit  and  deriving  the  system  output  by  taking  the  ma¬ 
jority  signal  (by  means  of  a  vote  taker)  of  the  three  independent  signals 
from  the  replicated  units.  The  system  can  be  partitioned  and  also  the 
vote  takers  themselves  can  be  triplicated.  A  major  application  of  TMR 
techniques  is  exemplified  by  the  design  of  the  Saturn  V  launch  vehicle 
guidance  system.  This  guidance  system  is  composed  of  two  parts  (i)  the 
general  purpose  computer  called  the  launch  vehicle  digital  computer  (LVDC) 
and  (ii)  the  launch  vehicle  data  adapter  (LVDA).  The  LVDA  is  an  input/ 
output  interface  unit  that  buffers  the  computer  to  its  launch  vehicle  en¬ 
vironment. 

The  computer  characteristics  of  the  LVDC  are  given  in  Table  II. 
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Type: 

General  purpose,  serial,  fixed  poii 
binary 

Clock: 

512  kilobits  per  second 

Speed: 

Add  -82  microseconds,  26  bit 
Multiply  -  328  microseconds,  24  bit 
Divide  -  566  microseconds,  24  bit 

Memory: 

32K,  28  bits 

Weight : 

44  Kg. 

Volume: 

0.62  m3 

Power : 

152  watts 

Table  II.  Saturn  V  LVDC  Characteristics 

The  reliability  goal  for  the  LVDC  was  established  at  0.99  for  a 
250-hour  mission.  It  was  felt  that  a  computationally  equivalent  simplex 
computer  using  conventional  architecture  would  only  be  able  to  achieve  a 
reliability  performance  of  0.63  for  the  250  hours.  This  increase  in  tar= 
get  reliability  from  0.63  to  0.99  was  achieved  by  utilizing  a  combination 
of  redundancy  structures. 

The  computer  central  logic  is  TMR  and  is  divided  into  seven  modules, 
each  with  an  average  of  thirteen  voted  outputs.  A  total  of  some  155  sig¬ 
nals  are  voted  on,  by  a  total  of  395  voters.  The  LVDA  employs  237  voters 
in  its  TMR  logic.  The  reason  for  the  LVDC  only  using  395  voters  and  not 
465  voters  (=  155x3)  is  that  many  of  the  central  logic  outputs  are  supplied 
to  duplex  circuits  in  the  memory  and  LVDA.  Hence,  only  two  voters  are  needed 
at  these  outputs. 

Instructions  are  composed  of  a  four-bit  operation  code  and  a  nine-bit 
operand  address.  The  nine-bit  address  allows  512  locations  to  be  directly 
addressed.  The  instruction  address  is  augmented  by  a  pair  of  sector  regis¬ 
ters  and  a  pair  of  module  registers.  Separate  registers  keep  track  of  data 
and  instructions  are  stored  two  to  a  word,  one  in  syllable  0  and  the  other 
in  syllable  1  of  a  memory  word. 

The  memory  of  the  LVDC  is  protected  by  means  of  duplex  redundancy. 

The  eight  identical  4K  memory  modules  may  also  be  operated  in  simplex  if 
additional  storage  capability  is  desired.  Two  methods  of  error  detection 
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are  used.  First,  parity  checking  is  performed  by  logic  which  is  protected 
by  TMR.  Also,  circuitry  is  provided  in  each  memory  module  to  detect  (i) 
the  absence  or  improper  timing  of  X  or  Y  half-select  currents,  and  (ii) 
the  presence  of  select  currents  in  more  than  one  X  or  one  Y  line.  Any  of 
these  error  detections  will  initiate  memory  switching. 

The  memory  operation  strategy  is  as  follows:  when  memory  units  are 
operated  in  duplex  only  one  of  the  two  buffer  register  outputs  (A  or  B)  is 
used.  If  an  error  is  detected  in  the  memory  being  currently  used,  the  mem¬ 
ory  select  logic  switches  over  to  the  alternate  unit.  The  incorrect  memory 
is  then  regenerated  with  the  output  of  the  replicated  memory.  Thus,  a  tran¬ 
sient  error  will  be  corrected  and  both  memories  will  be  restored  to  proper 
operation.  Switching  from  one  memory  to  the  other  is  virtually  instantan¬ 
eous  and  caused  no  interruption.  The  only  type  of  failure,  called  a  syste¬ 
matic  failure,  that  can  cause  complete  memory  system  failure  is  the  simultan¬ 
eous  failure  at  the  same  storage  location  of  both  memories. 

The  only  sub  systems  not  redundantly  protected  in  the  LVDC/LVDA  are  (i) 
the  clock  oscillator  and  (ii)  the  telemetry  logic.  The  reasoning  used  to 
justify  this  is  that  the  oscillator  only  consists  of  5  component  parts  which 
is  less  than  1%  of  total  components  used,  hence  the  probability  of  a  failure 
occurring  in  this  area  is  negligible.  However,  in  practice  the  designer  is 
advised  that  not  only  should  he  take  the  laws  of  statistics  into  account  but 
also  the  more  perverse  laws  of  Murphy.  No  explicit  reliability  model  of  this 
computer  was  developed,  however  extensive  Monte  Carlo  failure  simulation  an¬ 
alysis  was  performed.  The  Monte-Carlo-generated  estimate  for  the  reliability 
of  the  computer  logic  for  a  250-hour  mission  calculated  from  20,000  simulated 
missions  is  0.9994. 

4.3  STANDBY-SPARING  AND  THE  JPL-STAR  COMPUTER 

Standby-sparing  structures  ( 1-out-of-N)  are  exemplified  by  the  Jet  Pro¬ 
pulsion  Laboratory's  self-test  and  repair  computer  (STAR).  The  STAR  computer, 
like  its  architectural  predecessor  the  Raytheon  RAYDAC,  is  a  good  illustration 
of  an  architecture  that  used  non-autonomous  redundancy  as  its  principle  means 
of  failure  protection.  This  is  in  sharp  contrast  to  the  Saturn  V  LVDC  which 
as  described  earlier  used  TMR  predominantly,  duplex  redundancy  for  memory  and 
power  supplies,  and  quadding  in  an  isolated  instance  (note  that  no  standby¬ 
sparing  was  employed  anywhere). 


29 


The  principal  difficulties  in  implementing  standby-sparing  are:  (i) 
means  for  detecting  errors,  (ii)  means  for  switching  replicas,  (iii)  con¬ 
ditioning  requirements  of  spares  before  switching  them  on-line  (recovery 
strategy),  tiv)  isolation  of  the  replicas  from  the  instruction/ data  buses 
and  also  from  the  power  bus,  and  (v)  problems  of  checking  the  error  detec¬ 
tor  (i.e.,  how  to  check  the  checker). 

In  the  STAR,  error  detection  is  implemented  by  encoding  all  machine 
words  with  codes  that  are  preserved  under  arithmetic  operations,  as  well  as 
transmission.  In  conjunction  with  information  encoding,  decoders  to  check  the 
validity  of  information  are  provided  and  the  decoders  themselves  are  protec¬ 
ted  by  a  separate  autonomous  redundancy  structure.  Replicas  are  switched  by 
means  of  power  switching  rather  than  information  switching.  The  recovery 
strategy  is  implemented  by  means  of  software  interrupts,  and  program  rollback 
followed  by  retry.  Isolation  of  the  functional  units  is  by  means  of  component 
redundancy. 

The  principal  architectural  features  of  the  STAR  are: 

1.  All  data  words  and  address  portion  of  instruction  words  are  encoded 
for  error-detection  using  modulo  15  residue  coding.  This  permits  error  de¬ 
tection  concurrent  to  program  execution.  A  4  bit  check  byte  c(b)  is  appen¬ 
ded  to  the  7  byte  (28  bits)  non-redundant  binary  number  b,  where  c(b)  is  com¬ 
puted  to  be  the  byte-wise  complement  of  the  modulo  15  residue  of  b. 

2.  The  computer  is  subdivided  into  a  number  of  replaceable  functional 
units,  each  containing  its  own  operation  code  decoders,  and  sequence  gener¬ 
ators.  This  decentralization  and  replication  of  system  control  provides 
simple  fault  location  procedures  and  also  simplifies  interfacing  between 
units. 

3.  Information  lines  of  the  replicas  are  permanently  connected  to  the 
buses  through  isolating  circuits.  Replacement  of  units  is  implemented  by 
power  switching  of  the  unit. 

4.  Fault  detection,  recovery,  and  replacement  are  carried  out  by  the 
monitor  (TARP:  test  and  repair  processor). 


5.  Transient  faults  are  identified  by  program  retry.  Repetitious 
errors  are  identified  as  a  permanent  fault  and  eliminated  by  replacement 
of  the  failed  unit. 

6.  The  monitor,  which  is  the  "hard-core"  of  the  system,  is  protected 
by  means  of  autonomous  redundancy,  specifically  by  hybrid-redundancy. 

5.0  STANDBY  REDUNDANCY  VERSUS  AUTONOMOUS  REDUNDANCY 

The  advantages  to  autonomous  redundancy,  also  known  as  fault-masking 
redundancy  are: 

1.  Corrective  action  is  immediate  and  inherent  to  the  structure. 

2.  During  operation  there  is  no  need  for  separate  error  monitoring. 

3.  Machine  words  do  not  have  to  be  encoded  to  provide  error  detec¬ 
tion;  consequently  problems  arising  from  encoding  violations  under  arith¬ 
metic  versus  logical  versus  memory  operations  do  not  occur. 

4.  Impelmentation  of  such  structures  is  relatively  straightforward 
and  can  be  applied  to  "off-the-shelf"  subsystems  such  as  microcomputres . 

5.  Coverage,  the  probability  of  detecting  a  failure  given  that  there 
is  a  failure,  is  almost  100%,  and  is  readily  measurable. 

6.  Recovery  strategy  does  not  require  "conditioning"  of  replicas. 

7.  The  "hard-core"  of  the  system  is  relatively  small. 

8.  Internally  truly  real-time  fault-tolerance  and  continuous  system 
availability,  since  no  program  rollback  and  retry  procedures  are  required. 

Whereas,  the  advantages  of  standby  redundancy  are: 

1.  Power  is  required  by  only  one  replica  at  any  time. 

2.  All  replicas  can  be  utilized.  * 

3.  Number  of  replica  required  can  be  easily  tailored  to  a  mission. 

4.  No  increased  "fan-out"  problems  arise. 

5.  System  checkout  is  straightforward. 

6.  No  synchronization  between  replicas  is  required. 

7.  System  is  less  susceptible  to  externally  induced  transients. 
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6.0  PROTECTIVE  ARCHITECTURE  FOR  THE  "HARD-CORE" 

The  traditional  means  of  protecting  the  "hard-core"  of  predominantly 
standby-spares  redundant  systems,  namely  the  system's  monitor  and  reconfig¬ 
uration  unit,  is  by  means  of  TMR;  with  the  voter  of  the  TMR  protected  by 
componnet  redundancy.  It  should  be  noted  that  the  technique  of  quadding  at 
the  component  level  can  be  applied  even  to  wires  and  connections.  For  ex¬ 
ample  the  connection: 


can  be  replaced  by 


thus 


providing  greater  fault-tolerance.  In  this  illustration  the  connection  is 
provided  by  1-out-of-A  redundancy  and  all  b  connections  need  to  be  impaired 
to  violate  the  connection  and  at  least  two  of  the  wires  in  any  set  of  wires 
needs  to  open  in  order  to  violate  continuity. 

A  more  viable  alternative  to  TMR  is  the  class  of  redundancy  known  as 
hybrid  redundancy.  Hybrid  redundancy  combines  the  best  features  of  both 
the  autonomous  TMR  system  and  the  more  flexible  standby-spares  system.  The 
principles  of  system  operation  are  as  follows:  the  replicated  unit  outputs 
are  compared  against  the  majority  restored  system  output  by  means  of  the  dis¬ 
agreement  detectors  (exclusive-or  gates).  The  disagreement  detector  signals 
the  switching-unit  to  replace  the  unit  that  disagrees  with  the  majority.  Thus 
the  units  in  majority  redundancy  upon  failure  are  continuously  replaced  by 
the  spares,  until  all  the  spares  are  used  up,  at  which  instant  the  hybrid 
system  reduces  to  the  conventional  majority  voted  system  (NMR  in  general). 

In  the  next  section  we  will  consider  the  extra  hardware  required  to  imple¬ 
ment  the  voter-disagreement-detector-switching-unit  (V-D-S  unit)  for  the  TMR 
system. 
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6.1  IMPLEMENTATION  OF  THE  V-D-S  UNIT 


A  general  implementation  scheme  utilizing  iterative  cell  arrays  for 
a  V-D-S  unit  will  now  be  described.  First,  two  switching  strategies  are 
identified.  One  is  sequential  and  the  other  rotary,  in  the  first,  the 
spares  are  ordered  and  utilized  whenever  a  failure  is  detected  in  a  first- 
available-first-used  manner.  In  the  rotary  switching  strategy,  spares  re¬ 
tain  their  ordering  even  after  being  switched-in  to  replace  failed  units. 

Thus,  if  the  first  spare  (SI)  initially  replaced  the  third  majortiy  replica 
(M3)  and  subsequently  the  first  majority  replica  (Ml)  where  to  fail,  then  SI 
would  rotate  up  to  position  Ml  and  a  spare  S2  would  be  called  to  fill  in  the 
vacancy  at  M3.  The  analysis  on  switch  state  requirements  to  implement  these 
two  strategies  shows  that  if  more  than  one  spare  is  used  then  the  rotary 
scheme  provides  fewer  number  of  switch  states  (e.g.,  for  3  spares  the  ro¬ 
tary  switch  requires  20  states  in  contrast  to  the  34  required  for  sequen¬ 
tial  switching). 

The  basic  characteristics  of  an  iterative  cell  implementation,  for  a 
rotary  switch,  is  shown  in  Figures  6  and  7  where  an  iterative  cell  array  is 
a  series  of  identical  combinational  logic  cells  that  receive  inputs  from  (i) 
outside  the  iterative  cell  network  and  (ii)  from  the  cell  immediately  to  its 
left  via  intercell  leads.  Each  cell  computes  an  output  function  that  it  trans¬ 
mits  to  (a)  the  switching  network  and  (b)  to  a  new  intercell  input  for  the 
cell  immediately  to  its  right.  The  output  of  a  cell  generates  an  output  V* 
assigning  a  module  i  to  a  voter  position  j.  The  output  is  generated  as  a 
function  of  (i)  whether  its  corresponding  module  has  disagreed  or  not  and 
(ii)  as  a  function  of  the  number  of  prior  modules  found  to  be  functional.  The 
cell  3tate  and  output  as  a  function  of  previous  state  and  disagreement  input 
is  shown  for  a  typical  call  in  Figure  7. 

7.0  RECENT  TREND5  IN  FAULT-TOLERANT  ARCHITECTURES 

In  the  preceding  sections  we  have  discussed  a  number  of  HIFT  architec¬ 
tures,  One  thing  in  common  with  all  of  them  is  that  redundancy  has  been  ap¬ 
plied  at  the  intracomputer  level  rather  than  at  the  intercomputer  level. 
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Figure  7.  state  and  output  table  for  an  iterative  cell. 
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There  are  three  primary  factors  that  are  motivating  the  transition 
from  intracomputer  organizations  to  intercomputer  configurations.  Fi^st 
the  extreme  miniaturization  of  digital  logic  makes  available  complete  CPU 
and  memory  units  on  one  chip,  which  may  have  from  AO  to  20 0  pins.  If  LSIs 
are  custom  specified  then  the  state-of-the-art  can  allow  many  tens  of 
thousands  of  gates  per  chip.  This  availability  of  subsystem  performance 
capability  at  the  level  of  the  traditional  discrete  component  motivates 
consideration  of  intercomputer  level  architectures. 

Another  factor  is  that  the  demand  and  volume  production  of  LSI  chips 
has  drastically  lowered  haredware  costs.  Today's  hardware  designer  equates 
an  LSI  CPU  chip  cost  to  the  cost  that  existed  seven  or  eight  years  ago  for 
just  one  IC  flip-flop.  Thus  the  economics  allow  the  designer  to  readily 
think  in  terms  of  fault-tolerant  networks  of  mini -  or  miorooomputera. 

The  third  factor  has  been  the  increased  reliability  of  semiconductors. 
Manufacturers  have  even  started  giving  lifetime  warranties  for  many  devices. 

Of  course,  the  failure  rates  of  the  latest  LSI  chips  are  not  available  to 
any  level  of  confidence.  It  is  a  truism  that  by  the  time  failure  rates  are 
available  for  any  components  to  any  satisfactory  level  of  confidence,  the 
components  in  question  are  obsolete!  However,  one  can  extrapolate  increase 
in  reliability  of  going  from  MSI  to  LSI  by  knowing  the  increase  that  was  de¬ 
rived  by  going  from  ICs  to  MSI  or  from  discrete  transistors  to  ICs.  This  in¬ 
crease  in  reliability  at  the  basic  building  block  level  is  a  factor  in  con¬ 
sidering  bigger  structures  and  in  considering  fault-tolerance  at  higher  hier¬ 
archical  levels. 

The  foregoing  does  not  mean  that  protective  redundancy  and  fault-toler¬ 
ance  are  not  applied  at  the  intrachip  level  or  that  it  is  not  required  at  that 
level.  On  the  contrary,  as  the  chip  becomes  more  complex  the  semiconductor 
manufacturing  tolerances  become  stringent  and  result  in  poor  yields  of  chips 
that  can  meet  design  specifications  adequately.  In  fact  one  of  the  wa„  s  of 
achieving  greater  yields  is  to  provide  logic  redundancy  in  the  chip  itself. 

The  techniques  of  quadding  mentioned  earlier  are  readily  applicable  to  this 
use.  However,  since  the  semiconductor  industry  is  a  highly  competitive  com¬ 
mercial  enterprise  the  actual  techniques  employed  by  various  manufacturers 
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to  achieve  the  "black-chip"  specification  (analogous  to  "black-box"  speci¬ 
fication)  are  closely  guarded  secrets.  It  is  well  known,  and  can  be  ob¬ 
served  under  a  microscope,  that  more  gates  are  etched  or  deposited  on  a 
wafer  than  are  absolutely  required,  bonding  of  leads  to  the  wafer  can  then 
be  selectively  performed  to  the  best  gates  on  the  chip  to  meet  the  overall 
black-chip  specification. 

The  factors  outlined  above  make  it  feasible,  both  economically  and  en- 
gineeringwise,  to  design  intercomputer  fault-tolerant  architectures.  One 
such  recent  effort  is  the  Software  Implemented  Fault-Tolerant  (SIFT)  com¬ 
puter  proposed  by  Wensley  for  an  avionics  application  [34] . 

7. 1  THE  SIFT  COMPUTER 

The  SIFT  architecture  essentially  embodies  the  TMR  or  NMR  concepts 
in  software.  Majority  voting  is  performed  on  the  results  of  task  segments 
of  a  job  by  software  comparisons  and  decision  making.  Thus,  the  software  ma¬ 
jority  voting  is  not  at  the  hardware  logic  level  but  its  finest  resolution 
would  be  at  the  level  of  a  single  instruction  (execution).  This,  in  the  limit, 
could  approach  the  logic  level  resolution,  provided  that  the  machines  are  mi¬ 
croprogrammed  and  allow  task  breakdowns  to  the  microinstruction  level.  This 
operation  is  equivalent  to  partitioning  in  HIFT  systems.  Perhaps  at  the  mi¬ 
croinstruction  level  the  SIFT  system  should  be  called  a  "Micro -program  Imple¬ 
mented  Fault-tolerant"  ( MIFT )  system. 

The  integrity  of  the  SIFT  system  is  prevented  from  being  violated  by  a 
failed  and  misbehaving  processor  by  allowing  the  processors  to  only  read  from 
other  processor's  memory  and  never  be  allowed  to  write  into  them.  Thus,  es¬ 
sentially  each  memory  unit  has  two  sets  of  ports,  (i)  one  that  allows  its  com¬ 
panion  processor  to  read  and  write  into  it,  and  ( ii )  a  second  set  of  ports 
that  interfaces  to  the  interprocessor  bus  and  which  allows  other  processors 
to  only  read  information.  To  enable  processors  co  communicate  with  one  an¬ 
other  a  processor  has  to  write  a  note  into  its  companion  memory  and  then 
subsequently, at  some  time  the  other  processors  would  "look-in"  to  see  if  there 
v  re  any  messages  lying  around  for  them. 
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One  advantage  of  this  principle  of  organization  is  that  the  need  for 
synchronized  operation  of  replicated  units  is  avoided.  Each  processor  at 
the  termination  of  a  task  would  wait  until  the  other  processors  assigned 
by  the  system  executive  dispatcher  have  completed  the  replicated  software 
task  step.  The  processors  then  read  from  each  other  the  results  of  their 
tasks  and  effectively  perform  a  majority  agreement  selection  on  the  results. 

The  disagreeing  (faulty)  replica  need  not  necessarily  have  to  be  removed  but 
can  either  be  ignored  or  assigned  to  "void"  tasks,  i.e.,  tasks  having  no  over¬ 
all  effect. 

No  special  hardware  except  for  the  interfacing  requirements  are  needed. 

The  replicated  buses  are  also  considered  as  functional  units  or  processors 
which  have  to  be  addressed  and  a  link  established.  Subsequently  this  bus 
would  then  link  up  to  the  addressed  processor  memory  and  complete  the  "hand¬ 
shaking"  between  (i)  the  requesting  processor,  (ii)  the  available  bus,  and 
(iii)  the  addressed  memory  module.  After  a  quantum  of  time  the  bus  would  re¬ 
lease  the  "hand"  and  be  available  to  shake  another  requesting  processor's 
hand.  Analogous  to  the  hardware  implemented  TMR  and  NMR  systems,  the  SIFT 
configuration  can  equivalently  vary  the  degree  of  redundancy  by  allocating 
more  than  three  processors  to  a  job.  Also,  if  the  various  tasks  comprising 
a  job  have  varying  degrees  of  importance,  then,  equivalent  to  the  HIFT  approach, 
a  fault-tolerance  balancing  action  can  be  performed  by  allocating  different 
degrees  of  replicas  for  the  different  tasks  in  the  job. 

The  same  approach  vhat  is  used  to  setment  jobs  and  allocate  tasks  is 
also  used  to  protect  the  system  executive.  The  executive  system  has  two  com¬ 
ponents,  (i)  a  local  executive  which  resides  in  every  processor-memory  module 
and  is  responsible  for  such  fucntions  as  initiating  new  tasks  (dispatching), 
reporting  errors,  loading  tasks,  and  (ii)  a  system  executive  which  resides 
in  at  least  three  modules  (triplication)  and  has  the  functions  of  resource 
allocation,  scheduling  of  work  load  and  system  reconfiguration.  Each  pro¬ 
cessor  knows  which  processors  have  been  designated  as  the  system  executives. 

If  one  of  the  system  executive  processors  fails,  then  the  other  two  will  ignore 
the  failed  unit  and  assign  another  processor  as  system  executive  and  inform 
all  processors  about  this  "ouster  from  the  system  executive  troika." 
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7.2  THE  PRIME  COMPUTER 

Another  illustration  of  recent  trends  in  fault-tolerant  intercomputer 
architectures  was  exemplified  by  the  prime  project  at  the  University  of 
California  -  Berkeley.  Here  the  objective  is  more  to  provide  a  fail-softly 
capability  (quasi-fault-tolerance) .  The  user  is  provided  continuous  ser¬ 
vice  but  a  reduced  levels  of  performance  upon  the  occurrence  of  failures. 

The  PRIME  architecture  uses  off-the-shelf  microprogrammable  minicomputers 
(Digital  Scientific  Corporation  META-4  microprocessors)  to  implement  a  mul¬ 
tiprocessing  time-sharing  system  with  extensive  secondary  storage  capabil¬ 
ities  consisting  of  disk  drives.  Intercommunication  between  any  processors 
or  between  any  processor  and  any  disk  or  between  any  processor  and  any  ex¬ 
ternal  device  is  implemented  by  means  of  an  interconnection  network  (IN) 
which  is  a  distributed  network.  This  network  is  partitioned  and  powered 
such  that  failures  are  always  isolated  to  a  very  small  part  of  the  network. 

A  failure  in  the  IN  is  equivalent  to  a  failure  of  the  unit  attached  to  the 
IN  node  that  fails.  Hence,  this  allows  the  system  to  run  with  an  arbitrary 
variable  set  of  the  various  units.  The  reader  is  referred  to  the  PRIME 
literature  for  detailed  description  of  the  total  multiprocessor  timesharing 
system  [35-39] . 

The  PRIME  system  does  raise  a  very  pertinent  question, namely s  can  multi¬ 
processing  systems  be  considered  to  be  in  the  same  family  lineage  as  fault- 
tolerant  systems?  One  answer  is  that  since  quasi-fault-tolerance  is  the 
next  lower  hierarchical  level  (because  it  permits  graceful  performance  de¬ 
gradation),  and  since  software  implemented  fault-tolerant  systems,  as  we 
have  seen  in  some  detail,  are  intrinsically  multiprocessing  systems  (though 
however  operating  on  replicated  tasks)  it  follows  that  the  distinction  rests 
on  the  type  of  operating  system  (OS)  that  the  multiprocessor  operates  under. 
Whether  that  OS  implements  software  fault-tolerance,  as  in  the  SIFT  proposal, 
or  software  quasi-fault-tolerance, as  in  the  PRIME. 

Thus  we  note  that  once  the  hardware  is  provided  for  "universal"  connec¬ 
tivity  and  failure  isolation,  the  fault-tolerance  capability  of  these  inter¬ 
computer  configurations  resides  primarily  in  the  software  operating  systems 
driving  and  managing  the  hardware. 
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Thus,  although  the  HIFT  architectures  are  here  to  stay, and  will  always 
be  useful  in  selective  applications  such  as  avionics  and  aerospace,  for  the 
more  general  ground  based  commercial  applications, such  as  computer  utility 
type  operations, we  will  see  greater  proliferation  of  software  implemented 
fault-tolerant  and  quasi-fault-tolerant  systems  which  will  utilize  not 
handcrafted  processors,  but  commercially  available  off-the-shelf  mini-  and 
microcomputers . 

8.0  AUTOMATION  OF  RELIABILITY  MEASUREMENT  PROCESSES 

The  large  number  of  different  redundancy  schemes  available  to  the  de¬ 
signer  of  fault-tolerant  systems,  the  number  of  parameters  pertaining  to 
each  scheme,  and  the  large  range  of  possible  variations  in  each  parameter 
seek  automated  procedures  that  would  enable  the  designer  to  rapidly  model, 
simulate  and  analyze  preliminary  designs  and  through  man-machine  symbiosis 
arrive  at  optimal  and  balanced  fault-tolerant  systems  under  the  constraints 
of  the  prospective  application. 

Such  an  automated  procedural  tool  which  can  model  self-repair  and  fault- 
tolerant  organizations,  computer  reliability  theoretic  functios,  perform  sen¬ 
sitivity  analysis,  compare  competitive  systems  with  respect  to  various  mea¬ 
sures  and  facilitate  report  preparation  by  generating  tables  and  graphs  is 
implemented  in  the  form  of  an  on-line  interactive  computer  program  called  CARE 
(for  Computer-Aided  Reliability  Estimation)  [40]  .  Essentially  CARE  consists 
of  a  repository  of  mathematical  equations  defining  the  various  basic  redun¬ 
dancy  schemes.  These  equations,  under  program  control,  are  then  interre¬ 
lated  to  generate  the  desired  mathematical  model  to  fit  the  architecture  of 
the  system  under  evaluation.  The  mathematical  model  is  then  supplied  with 
ground  instances  of  its  variables  and  then  evaluated  to  generate  values  for 
the  reliability  theoretic  functions  applied  to  the  model. 

The  mathematical  models  may  be  evaluated  as  a  function  of  absolute  mis¬ 
sion  time,  normalized  mission  time,  non-redundant  system  reliability,  or 
any  other  system  parameter  that  may  be  applicable. 


X  =  Powered  failure  rate 

y  =  Unpowered  failure  rate 

K  =  X/y  = Dormancy  factor 

T  =  Mission  time 

T  =  Normalized  mission  time 

R  =  Simplex  reliability 

R  =  Dormant  reliability,  exp(-  T). 

S  =  Number  of  spares 

n  =  (N-l)/2  where  N  is  the  total  number  of  multiplexed 
units 

Q  =  Quota  or  number  of  identical  units  in  simplex  systems 
C  =  Coverage  factor,  Prtrecovery/failure) 

RV  =  Reliability  of  restoring  organ  or  switching  overhead 
Z  =  Number  of  identical  systems  in  series 
W  =  Number  of  cascaded  or  partitioned  units 
P  =  Probability  of  unit  failing  to  "zero" 

TMR  =  Triple  modular  redundancy 
TMRp  =  TMR  system  with  probabilistic  compensating  failures 
( 1 ,S)  =  Standby  spare  system 
(N,S)  =  Hybrid  redundant  system 
(3,S)sim  =  Hybrid/simplex  redundant  system 
MTF  =  Mean  life 

R(MTF)  =  Reliability-'  at  the  mean  life 

Table  III.  Table  of  Abbreviations  and  Terms 

8. 1  UNIFYING  NOTATION 

A  unifying  notation,  developed  to  describe  the  various  system  configura¬ 
tions  using  selective,  massive  or  hybrid  redundancy  is  illustrated  in  Figure 
8. 

N  refers  to  the  number  of  replicas  that  are  made  massivley  redundant 
(NMR);  S  is  the  number  of  spare  units;  W  refers  to  the  number  of  cascaded 
units,  i.e.,  the  degree  of  partitioning;  R()  refers  to  the  reliability  of 
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the  system  as  characterized  in  the  parentheses;  TMR  stands  for  triple  mod¬ 
ular  redundant  system  (N  =3);  the  NMR  stand  for  N-tuple  modular  redundancy. 

A  hybrid,  redundant  system  H(N,S,W)  is  said  to  have  a  reliability 
R(N,s,w) .  If  the  number  of  spares  is  S=0,  then  the  hybrid  system  reduces 
to  a  cascaded  NMR  system  whose  reliability  expression  is  denoted  by  R(N,0,W); 
in  the  case  where  there  are  no  cascades,  it  reduces  to  R(N,0,1),  or  more 
simply  R(NMR).  Thus  the  term  W  may  be  elided  if  W  =  1.  The  sparing  system 
R( 1 ,S)  consists  of  one  basic  unit  with  S  spares. 

Furthermore,  the  convention  is  used  that  R*  indicates  that  the  unre¬ 
liability  (1-RV)  due  to  the  overhead  required  for  restoration,  detection, 
or  switching  has  been  taken  into  account,  e.g. ,  R*(NMR)  =  Rv*R(NMR);  if  the 
asterisk  is  elided  then  it  is  assumed  that  the  overhead  has  a  negligible 
probability  of  failure.  This  proposed  notation  is  extendable  and  can  in¬ 
corporate  a  number  of  functional  parameters  in  addition  to  those  shown  here 
by  enlarging  the  vector  or  lists  of  parameters  within  the  parentheses,  e.g., 
R(N,W,W, . . . , X, Y,Z) . 

8.2  EXISTING  RELIABILITY  PROGRAMS 

Some  representative  reliability  evaluation  programs  are  the  RCP,  RELAN, 
and  REL70.  rcp  [41,42]  is  a  program  which  can  model  a  network  of  arbi¬ 
trary  series-parallel  combinations  of  building  blocks  and  analyzes  the  sys¬ 
tem  reliability  by  means  of  probabilistic  fault-trees.  RELAN  [43]  is  an 
interactive  program  developed  by  TIME/WARE  and  is  offered  on  the  Computer 
Sciences  Corporation’s  INFONET  network.  RELAN,  like  RCP  models  arbitrary 
series-parallel  combinations  but  in  addition  allows  a  wide  choice  (any  of  17 
types)  of  failure  distributions.  RELAN  has  concise  and  easy  to  use  input 
formats  and  provides  elegant  outputs  such  as  plots  and  histograms.  REL70 
[60]  and  its  forerunner  REL  [61]  are  interactive  programs  developed 
in  APL/360.  Unlike  RCP  and  RELAN,  REL70  is  more  adapted  for  evaluating 
systems  other  than  series-parallel,  such  as  standby-replacement  and  triple 
modular  redundancy.  It  offers  a  large  number  of  system  parameters,  in  par¬ 
ticular,  C, the  coverage  factor  defined  as  the  probability  of  recovering  from 
a  failure  given  that  the  failure  exists  and,  Q,  the  quota,  which  is  the  num¬ 
ber  of  modules  of  the  same  type  required  to  be  operating  concurrently.  REL70 
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is  primarily  oriented  toward  the  exponential  distribution  though  it  does 
provide  limited  capabilities  for  evaluating  reliability  with  respect  to 
the  Weibull  distribution;  its  outputs  are  primarily  tabular.  Since  APL  is 
an  interpretive  language,  REL  is  slow  in  operation;  however,  its  designers 
have  overcome  the  speed  limitation  by  not  programming  the  explicit  relia¬ 
bility  equations  but  approximate  versions  which  are  applicable  to  short  mis¬ 
sions  by  utilizing  the  approximation  (l-exp(-XT))  =  XT  for  small  values  of  XT. 

The  CARE  program  is  a  general  program  for  evaluating  fault-tolerant  sys¬ 
tems,  general  in  that  its  relaibility  theoretic  functions  do  not  pertain  to 
any  one  system  or  equation  but  to  all  equations  contained  in  its  repository 
and  also  to  compelx  equations  which  may  be  formed  by  interrelating  the  basic 
equations.  This  repository  of  equations  is  extendable.  Dummy  routines  are 
provided  wherein  new  or  more  general  equations  may  be  placed  as  they  are  de¬ 
veloped  and  become  available  to  the  fault-tolerant  computing  community.  For 
example,  the  equation  developed  by  Bouricius  et  al.,  for  standby-replacement 
systems  embodying  the  parameters  C  and  Q  has  been  bodily  incorporated  into 
the  equation  repository  of  CARE. 

8.3  CARE’S  REPOSITORY  OF  EQUATIONS 


The  equations  residing  in  CARE,  based  on  the  exponential  failure  law, 
model  the  following  basic  fault-toleranct  organizations: 

(1)  Hybrid-redundant  (N,S)  systems 

(a)  NMR  (N,0)  systems 

(b)  TMR  (3,0)  systems 

(c)  Cascaded  or  partitioned  versions  of  the  above 
systems 

(d)  Series  string  of  the  above  systems. 
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(2)  Standby-sparing  redundant  (1,S)  systems 


(a)  K-out-of-N  systems 

(b)  Simplex  systems 

(c)  Series  string  and  cascaded  versions  of  the 
above. 

(3)  TMR  systems  with  probabilistic  compensating  failures. 

Series  string  and  cascaded  versions  of  the  above. 

(4)  Hybrid/simplex  redundant  (3,S)sim  systems. 

(a)  TMR/simplex  systems 

(b)  Series  string  and  cascaded  versions  of  the 
above. 

The  equations  for  each  of  these  systems  are  the  most  general  representation 
of  their  systems,  parameterizing  mission  time,  failure  rates,  dormancy  fac¬ 
tors,  coverage,  number  of  spares,  number  of  multiplexed  units,  number  of 
cascaded  units, and  number  of  identical  systems  in  series.  The  definitions 
of  these  parameters  reside  in  CARE  and  may  be  optionally  requested  by  the 
user.  More  complex  systems  may  be  modeled  by  taking  any  of  the  above  listed 
systems  in  series  reliability  with  one  another. 
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PART  II 


SELF-CHECKING  CIRCUITS 


1.0  INTRODUCTION 

In  1968  Carter  and  Schneider  [4  ]  defined  a  self-checking  circuit  to 
be  a  circuit  whose  output  is  encoded  in  an  error-detecting  code.  Anderson 
[2,3]  further  defined  such  circuits  as  having  properties  of  self¬ 
testing  and  fault-secureness.  Wakerly  [6  ]  introduced  the  concept  of  par¬ 
tially  self-checking  circuits. 

Researchers  have  expanded  upon  these  concepts  and  designed  self¬ 
checking  checkers  and  entire  computers  based  on  these  properties.  However, 
it  is  unfortunate  that  the  initial  historic  definition  by  Carter  and 
Schneider  is  rather  narrow  in  that  only  circuits  whose  outputs  are  encoded 
in  error  detecting  codes  are  considered  to  be  self-checking  circuits.  As 
seen  in  the  section  on  redundancy,  hybrid  and  other  system  variants  can  also 
be  self-checking  without  resorting  to  any  error-detecting  codes  whatsoever. 

This  section  on  self-checking  circuits  presents  an  introduction  to 
the  fundamental  underlying  concepts  and  adheres  closely  to  the  literature 
and  specifically  is  indebted  to  the  notation  developed  by  Wakerly  [1  ]. 

Basic  concepts  of  code  space  and  detectable  errors  are  explained  and 
examples  presented.  Then  the  notions  of  fault-secureness  and  self-testing 
are  introduced.  Self-testing,  partially  self-testing,  and  totally  self¬ 
testing  circuits  are  then  described. 

Totally  self-checking  networks  are  then  defined  and  an  introduction 
to  morphic  Boolean  logic  is  presented  as  a  systematic  methodology  to  the 
design  of  totally  self-checking  networks. 


2.0  BASIC  CONCEPTS  OF  CODE  SPACE  AND  DETECTABLE  ERRORS 

Let  U  be  the  universe  of  all  vectors  of  length  n  (n-tuples),  then  the 
subset  S  (called  the  code  space)  is  an  error-detecting  code  if  the  vectors 
in  S  are  chosen  such  that  every  fault  of  interest  affecting  vectors  in  S 
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(each  vector  in  S  is  called  a  code  word)  will  produce  vectors  that  are  not 
in  S,  i.e.,  in  U-S  (each  vector  in  the  noncode  space  u-S  is  called  a  noncode 
word).  If  a  failure  alters  a  code  word  x  into  another  n-tuple  x'  then: 

(i)  if  x'  is  also  in  S  (the  code  space) 
then  it  is  an  undetectable  error f 

(ii)  if  x'  is  in  U-S  (the  noncode  space  S) 
thm  it  is  a  detectable  error. 

Hence  for  the  effect  of  a  fault  to  be  detectable  it  must  produce  an  error 
such  that  some  codeword  (in  code  space)  gets  mapped  onto  some  noncode  word 
(in  noncode  space).  For  a  code  to  be  able  to  detect  the  set  of  failures  of 
interest  the  code  space  S  must  be  chosen  so  as  to  have  this  mapping  property. 
These  basic  concepts  are  illustrated  in  Figure  1 . 

Example  1 :  Encode  all  2-bit  words  (x^x^)  with  even  parity  (Pe>.  An  en¬ 
coded  message  (code  word)  is  then  of  length  3. 


All  binary  3-tuples: 

U  =  0  0  0 

1  0  0 

0  0  1 

1  0  1 

0  1  0 

1  0  1 

0  1  1 

1  1  1 

All  uncoded  messages: 

x  X, 

2  1 

0  0 
0  1 
1  0 
1  1 

All  code  words:  S  =  x  x,  P 

2  1  e 

0  0  0 

0  1  1 

1  0  1 

1  1  0 
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-  v» 


(with  even  parity) 


U  (universe  of  n- tuples) 


All  non-code  words:  U-S  =  001 

0  1  0 
1  0  0 


1  1  1 


O 


In  Figure  2  we  show  how  this  simple  form  of  parity  encoding  can  be 
used  in  a  self-checking  circuit.  The  inputs  x^  and  x^  are  used  to  drive 
a  parity  generator  which  generates  P  .  The  encoded  data  is  then  trans¬ 
mitted  over  a  channel.  The  received  data,  consisting  of  x^,  x 2  and  Pg  is 
then  again  processed  by  a  checker ,  and  an  odd  parity  (E  =  1)  indicates  an 
error.  Figure  3  indicates  the  possible  I/O  mappings. 

In  general,  for  all  f^  in  the  fault-set  the  input  code  word  could  get 
mapped  onto  any  3-tuple  in  U.  If  f^  is  such  that  an  input  is  mapped  into  S 
then  f^  is  undetectable  E=0.  If  f^  is  such  that  an  input  is  mapped  onto  U-S, 
then  f^  is  detectable  E  =  1.  Specifically,  for  the  set  of  fault  (f.)  such 
that  only  an  odd  number  of  bits  in  the  input  code  word  are  altered,  the  cor¬ 
responding  output  words  will  be  in  the  noncode  space  U-S,  hence  { f  }  will  be 
detectable,  indicated  by  E=l. 

We  will  now  examine  and  classify  different  fault  sets  by  their  corres¬ 
ponding  mapping  properties,  and  also  consider  input  possibilities  other  than 
code  words.  For  total  generality  we  need  to  consider: 


(i)  the  universe  of  inputs 

(ii)  the  universe  of  outputs 

(iii)  the  universe  of  faults 


and  the  set  of  transfer  functions  {T^}  associated  with  all  possible  fault- 
induced  mappings  onto  the  code  space  S  and  noncode  space  S.  The  absence  of 
a  fault,  the  null  fault,  will  be  denoted  by  X.  Thus, 


fault- free,  X 


input: 


T(  input  ,A  ) 


output :x1  in  code  space  S 
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OUTPUTS: 


fault, f. 
’  1 


input :x 


T ( i  nput , f  ) 

output:  x. 


For  the  fault-free  case  the  output  will  always  be  in  the  code  space  S  hence 
the  checker  will  correctly  indicate  absence  of  any  error. 

In  general  for  the  faulty  case  the  output  x,  may  be  anywhere  in  the 
output  space  U.  Four  cases  can  be  identified: 

Case  I.  Fault-free.  Transfer  function  is  T(input,\). 

Output  is  correct.  Output  is  in  code  spaces. 

No  error  is  indicated. 

Case  II.  Benign  Fault.  Transfer  function  is  T ( input, f  ) 

Output  is  correct.  Output  is  in  code  space,  S. 

No  error  is  indicated. 

Case  III.  Detectable  Fault.  Transfer  function  is  T(input,fd). 

Output  is  incorrect.  Output  is  in  noncode  space,  S. 

Error  is  indicated. 

Case  IV.  Undetectable  Fault.  Transfer  function  is  T ( input , f^d ) . 
Output  is  incorrect.  Output  is  in  code  space, S. 

Error  is  not  indicated. 


We  summarize  and  illustrate  these  definitions  in  Figure  4  and  Table  I. 

One  of  the  goals  of  fault-tolerant  design  is  to  reduce  the  conditions 
under  which  Case  IV  can  occur. 


3.0  FAULT-SECURE  CIRCUITS 

If  for  a  given  design  only  the  following  transfer  functions  are  pos¬ 
sible  T(x,X),  T ( x , f d ) ,  T(x,fd>,  i.e.,  Cases  I,  II  and  III,  then  the  circuit 
is  called  a  fault-secure  circuit.  (An  alternative  statement  would  be  that 


FAULT  SPACE,  U 


Pour  possible  fault  classifications:  fault  free 
benign,  detectable,  undetectable 


ERROR  I  ERROR 


T(x,f)f  s  implies  that  f=f  or  X.  Figure  5  indicates  the  acceptable  trans- 

b 

fer  functions  for  a  fault  secure  circuit.  When  inputs  are  from  the  secure 
input  set ,  U.  and  the  fault  set  is  the  secure  fault  set,  Ur  ,  fault  secure - 
ness  property  guarantees  that  no  fault  from  the  fault  set  will  produce  an  un¬ 
detectable  incorrect  output.  It  should  be  noted  that  in  the  above  the  set 
{f^}  may  be  empty, i.e. ,all  faults  may  be  of  the  type  ffa  and  hence  undetec¬ 
table.  It  is  important  to  stress  that,  in  general,  a  circuit  is  not  fault 
secure  with  respect  to  all  possible  faults,  but  rather  with  respect  to  a 
given  set  of  class  of  faults,  e.g.,  all  single  stuck-at  faults. 


4.0  SELF-TESTING  CIRCUITS 

If  for  every  fault,  f.,  in  the  set  of  faults  (which  are  under  consider' 

ation)  there  exists  some  input,  say  x  (x  is  called  a  test  for  f .)  such  that 

T(x^,f  belongs  to  the  noncode  space  S,  then  the  circuit  is  called  a  self- 

testing  circuit.  Thus  a  self-testing  circuit  is  one  for  which  every  fault  is 

detectable  by  applying  some  input ;  the  input  is  called  a  test  for  that  fault . 

The  set  of  f.'s  is  called  the  tested  fault  set.  As  shown  in  Figure  6  for  some 

x.,  f .  may  be  undetectable,  but  for  some  x  ,  f*  is  detectable. 

J 

The  input  set  fcr  self-testing  circutis  is  called  the  normal  input  set, 

N,and  every  input  should  occur  during  normal  operation  in  order  to  detect  the 

presence  of  a  fault  from  the  tested  fault  set. 

The  secure  input  set,  N  may  or  may  not  be  a  subset  of  N,  but  for  all 

s 

practical  purposes  can  be  assumed  to  be  a  subset  of  N  since  those  inputs  out¬ 
side  of  N  would  never  occur  in  normal  operation  (by  definition). 

Similarly  the  secure  fault  set,  F  » is  assumed  to  be  a  subset  of  F  , 

3 

the  tested  fault  set.  The  interrelationship  and  combined  effect  of  the  two 
properties  of  fault-secureness  and  self-testing  are  shown  in  Figure  7. 

Now  we  can  look  at  the  relationship  between  Ng,  the  secure  input  set, and 
N,the  normal  input  set.  Three  cases  are  of  interest: 

Case  (i):  If  N  =N  then  the  circuit  is  called  totally  self-checking , 

3 

and  is  both  self-testing  and  fault-secure.  (Note:  a  self- 
checking  circuit  is  defined  as  any  circuit  whose  output 
is  encoded  in  an  error-detecting  code. 
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TESTED  FAULT  SET 


Self-testing  transfer  functions 


Case  (ii): 


If  Ngc  N  (a  proper  non-null  subset)  then  the  circuit  is 
called  partially  self -checking . 

Case  (iii):  If  N  Hn=  A(null)  then  the  circuit  is  self-testing-only, 
s 

and  not  fault-secure. 

These  definitions  are  summarized  in  Table  II. 


N  :  N 

S 

Properties  of  Self-Checking  Circuits 

References 

N  =  N 

Self-testing  4  fault  secure  (totally 

Anderson  [2,3] 

s 

self-checking) 

N  =  A 
s 

Self-testing  4  not  fault  secure 

Carter  [5] 

N  Cn 
s 

Partially  self-checking  circuit 

Wakerly  [6] 

TABLE  II.  Properties  of  Self-Checking  Circuits 

The  principal  advantage  of  partially  self-checking  circuits  is  that 
when  using  some  simple  codes,  they  may  be  used  to  perform  logical  operations 
and  yet  perserve  proper  encoding. 


5.0  TOTALLY  SELF-CHECKING  NETWORKS 

In  order  to  have  a  totally  self-checking  network  the  checker  must  also 
be  self-checking.  A  totally  self-checking  network  is  one  that  consists  of  a 
functional  circuit  ai.  checker  where  both  the  functional  circuit  and  the 
checker  are  totally  self-checking. 

In  Table  III  we  summarize  the  main  results  dealing  with  self-checking 


circuits. 


REFERENCE 


RESULT 


1.  Toy  [  7  ]  and  Carter  et  al.  [  5  ] 

2.  Anderson  4  Metze  [  3  ] 

3.  Reddy  [  9  ] 

4.  Ashjaee  4  Reddy  [10] 

5.  Shedletsky  [11] 

6.  Marouf  4  Friedman  [12] 

7.  Carter  et  al.  [13] 

8.  Gay  [14] 

9.  Kolupaev  [15] 


General  design  of  a  self-testing 
only  1-out-of-n  decoder 

General  procedure  for  designing  totally 
self-checking  checkers  for  k-out-of-2k 
codes  for  all  k 

Also  1-out-of-n  codes  for  all  n  except 
n  =  3  and  n  =  7 

Design  of  a  totally  self-checking  1-out- 
of-7  checker. 

Conjecture  that  no  1-out-of-3  checker 
exists. 

Conjecture  that  no  totally  self-checking 
equality  checker  with  only  three  normal 
inputs  exists. 

Method  for  calculating  a  rollback  inter¬ 
val  for  use  with  partially  self-checking 
circuits. 

Algorithmic  procedure  for  efficient 
design  of  general  m-out-of-n  checkers 
for  m£2. 

Design  outline  of  an  entire  self¬ 
checking  computer  processor 

Markov  models  to  compute  the  optimal 
testing  strategy  for  a  system  with  par¬ 
tial  on-line  error-detection. 

Method  to  synthesize  desired  totally 
self-checking  network  by  searching 
for  an  appropriate  cascade  of  smaller 
generalized  self-checking  circuits. 

Method  works  for  small  networks  but 
impractical  for  large  nets. 


TABLE  III.  Notable  "Self-Checking"  Results  and  Milestones 

6.0  MORPHIC  BOOLEAN  FUNCTIONS  AND  THEIR  IMPLEMENTATION  AS 
SELF-CHECKING  CIRCUITS 

It  is  sometimes  desirable  to  implement  a  Boolean  variable  x  using  a  pair- 
of  lines  (x^Xg).  Assume  that  x  =  1  is  represented  sometimes  as  x1  =1,  x2  =0, 


65 


and  at  other  times  as  x^  =0,  =  1.  Similarly  if  x  =0  is  represented  by 

both  (x^,x2)  =  (0,0)  and  (1,1),  then  if  either  x^  or  x^  is  stuck  at  some 
value  an  error  will  be  produced.  We  refer  to  codings  of  this  type  as 
moi’phio  functions. 

If  e^.e^S  (0,1),  then  let  the  mapping  M  be  defined  as  follows: 

M:  ( (e1  ,e2) ,  (?1  ,e2> )  (-♦  1 
( (?1  ,e2) ,  (e1 ,12) )  (-»  0 

Since  morphic  logic  functions  (e . g . , AND ,NAND ,etc . )  have  to  be  implemented 
using  the  values  taken  by  the  pair  of  lines,  a  correspondence  must  be  es¬ 
tablished  between  a  .  ordinary  Boolean  function  g(a1 ,a2> . . . ,a^)  and  the  logic 
function  whose  inputs  and  outputs  are  pair  of  lines.  The  ordinary  Boolean 
algebra  is  defined  over  the  state  space  £o , 1 3 ,  i.e.,  B  = lo, 1 ,[*}} ,  where  {*3 
is  the  usual  set  of  logical  operations.  For  the  pair  of  lines  we  define 
the  operators  [*M3  over  the  state  space  of  the  pair  of  lines,  i.e., 

{[(e1 ,e2) , (e1 ,e2)J, [(e1 ,e2) , (e1 ,e2)}, [»M3J.  In  order  to  establish  a  corres¬ 
pondence  between  the  Boolean  operator  and  the  operator  t*M3  for  the  pair 
of  lines,  {*M}  is  defined  to  be  a  morphism  (operation  preserving  property) 
between  {0,l,{*]}and  U(e1 ,e1 ) , (e1 ,?2) }, t(e1 .e^) , (e1 ,e2) 3 , [*M33 ,  i.e.,  be¬ 
tween  {«}  and  {*m3.  If  si  and  s  j  £{(0,0), (0,1), (1,0), (1,1)3  then  M(s^  s j )  = 

M(s^)  *M(Sj) . 

As  an  example  let  M:  joi 

(10 

The  morphic  AND  function  (MAND)  can  be  defined  as  follows: 
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Using  two  K-maps  we  can  determine  how  to  implement  the  MAND  function.  Let  the 
morphic  variables  be  A^  and  A^,  and  let  A^  be  implemented  as  (a^,a12)  A2 


as  (a  ) .  T^ien  a  judicious  choice  of  implicants  would  lead  to  the  follow¬ 


ing  realization  (shown  below) 
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Hence  \\*2  =  ( (a,  ,  (a^  ^-w^a^ > )  • 


Other  functions  CVu/©„)  -  can  be  generated  similarly.  The  implementation 

M  M 


of  MAND  function  shown  above  is  not  unique. 

In  summary  a  correspondence  is  set  up  between  the  ordinary  Boolean  func¬ 
tion  defined  over  [l,oj  and  the  morphic  Boolean  function  over  the  state  space 

££  <ei»e2> * (-ei-'e2)3'£  (el'e2}  '  (ei'e2)^‘ 

The  use  of  self-checking  circuitry  in  the  hardcore  part  of  the  system 
enhances  the  reliability  of  the  error  handling  portion  of  the  system.  It 
has  been  shown  that  the  Morphic  Boolean  function  can  be  used  as  self¬ 
checking  operators  and  can  be  interconnected  to  provide  self-checing  im¬ 
plementation  of  logic  circuits.  An  example  of  a  self-checking  circuit 
would  be  the  MAND  function  derived  earlier.  This  operator  can  be  used  to 


implement  a  self-checking  comparator  as  follows.  If  d^  d^»  d^,  and  d^ 


are  four  signal  lines,  then  to  ensure  reliability  of  the  signal  on  these 
lines  the  four  lines  are  complemented  and  the  pairs  of  lines  are  compared 
using  the  self-checking  morphic  comparator  as  follows  in  Fig.  8. 
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Figure  8.  Self -checking  checker 


7.0  CONCLUSION 


Part  II  of  this  chapter  has  described  the  basic  concepts  of  self¬ 
checking  circuits.  Self-checking  circuits  were  defined  to  be  circuits 
whose  outputs  are  encoded  in  an  error-detecting  code.  The  underlying 
theory  based  on  code  and  noncode  space  was  developed  with  illustrative 
examples.  The  notions  of  fault-secureness  and  self-testing  were  intro¬ 
duced.  Self-testing,  partially  self-testing,  totally  self-testing  cir¬ 
cuits,  and  totally  self-checking  networks  were  described  and  defined. 

An  introduction  to  Morphic  Boolean  logic  as  a  systematic  aid  to  the 
design  of  totally  self-checking  nteworks  was  also  provided.  This  section 
forms  an  important  theoretical  basis  for  the  design  of  totally  self¬ 
checking  computers. 
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PART  III 


CODING  TECHNIQUES* 


1.0  INTRODUCTION 

Part  III  of  this  chapter  discusses  the  use  of  concurrent  diagnosis  tech¬ 
niques  based  on  codes  in  digital  computing  systems.  More  specifically,  it 
deals  with  a  subset  of  such  concurrent  techniques,  since  massive  redundancy 
methods  such  as  circuit  triplication  and  voting  or  circuit  duplication  (with 
diagnostic  checks  in  case  of  disagreement)  have  already  been  considered  in 
an  earlier  section  (see  Part  I  of  this  chapter).  The  two  types  of  coding 
methods  or  concurrent  diagnosis  considered  in  this  section  are: 

(a)  partial  circuit  duplication  using  a  checking  algorithm 
to  detect  or  correct  errors. 

(b)  use  of  encoded  operands  such  that  errors  can  be  detected 
by  means  of  a  subsequent  checking  algorithm. 

The  distinction  between  the  two  types  of  diagnosis  will  become  more  clear 
when  we  consider  both  separate  and  non-separate  methods  of  coding. 

Concurrent  diagnosis  means  the  "immediate"  and  "local"  checking  or 
correcting  of  information  being  transmitted  or  processed.  The  meaning  of 
"immediate"  and  "local"  may  be  debatable,  but  here  immediate  means  either 
during  a  minor  computation  or  just  after  it;  and  local  will  mean  that  the 
additional  hardware  requirements  are  built  into  the  processor  itself. 

This  section  is  based  primarily  on  the  papers  by  Kautz  [1  ],  Avizienis 
[  2  ]  and  Armstrong  [  3  ],  but  additional  references  are  cited  where  further 
information  was  found  in  various  particular  areas.  A  quick  survey  of  the 
literature  in  the  field  shows  a  fairly  heavy  emphasis  on  the  theory  of 
codes,  with  little  being  written  about  the  usage  of  such  codes  in  practice. 
However,  the  advent  of  smaller  and  cheaper  logic  circuits  due  to  the  use 
of  integrated  circuit  techniques  and  particularly  VLSI  (very  large  scale 

1 

This  section  is  an  edited  abbreviated  version  of  an  unpublished  paper 
by  G.  Cole. 
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integration),  has  made  the  extra  hardware  requirements  of  checking  techniques 
more  feasible.  Also  the  more  complex  problem  solving  capabilities  of  mod¬ 
ern  computers  has  made  the  use  of  error  detection  or  correction  more  neces¬ 
sary.  One  can  readily  envision  the  use  of  computers  in  problems  of  such 
size  that  error-free  solutions  are  otherwise  impossible. 

The  IBM  7030  (STRETCH)  computer  is  a  notable  example  of  a  design  which 
utilized  extensive  error  detection  and  correction  capabilities.  The  prob¬ 
lem  requirements  for  this  machine  were  of  such  magnitude  that  both  fast  and 
error  free  computation  were  required.  Each  memory  word  had  8  check  bits  in 
addition  to  the  64  data  bits  for  automatic  correction  of  any  single  bit  er¬ 
rors.  Other  operations  were  also  checked  by  the  use  of  parity  checks,  dup¬ 
lication  or  computation,  and  "casting  out  threes."  When  errors  were  de¬ 
tected,  they  were  corrected  and/or  the  error  was  recorded  on  a  special  main¬ 
tenance  output  device.  An  estimated  14%  of  the  entire  computer  was  used 
solely  for  checking  purposes  [15] 


2.0  TRANSMISSION  CODES 

For  ease  of  presentation,  one  can  divide  the  subject  of  error  detection 
and  correction  into  two  groups,  namely  transmission  codes  and  arithmetic 
codes.  The  transmission  codes  check  for  errors  during  the  transmission  of 
information,  such  as  between  processor  units,  during  memory  accesses,  etc. 

The  arithmetic  codes  can  also  detect  errors  in  such  information  transfers, 
but  more  importantly,  they  can  check  the  correctness  of  arithmetic  opera¬ 
tions.  The  obvious  question  is  then,  "why  don't  we  always  use  arithmetic 
codes  instead  of  transmission  codes?"  The  answer  is  twofold;  (1)  the  arith¬ 
metic  codes  often  require  more  check  bits  than  the  transmission  codes,  and 
(2)  the  arithmetic  codes  are  not  nearly  as  well  known  as  the  transmission 
codes,  e.g.,  the  use  of  parity  bit ( s ) . 

2.1  PARITY  BITS 

The  work-horse  of  the  error  checking  world  is  the  parity  bit.  The  use 
of  this  one  extra  bit  to  detect  the  presence  of  any  single  bit  position  er¬ 
ror  has  been  adopted  to  such  an  extent,  that  it  is  unusual  when  it  is  not 
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used  in  information  transfers.  The  most  common  type  of  parity  check  is 
based  on  the  selection  of  a  1  or  0  for  the  parity  bit  such  that  the  total 
number  of  Is  in  the  word  is  odd.  The  choice  of  an  even  number  of  Is  would 
work  equally  well,  of  course.  Indeed,  the  use  of  even  parity  is  consistent 
with  Gamer's  generalized  theory  of  parity  checking  in  which  the  parity 
digit  is  the  modulo  b  sum  of  the  digits  of  the  number,  where  b  is  the  base 
of  the  number  system  [ 9  ] . 

The  parity  bit  will  allow  the  detection  of  any  single  bit  error  (or 
an  odd  number  of  errors)  but  will  not  detect  any  combination  of  two  errors 
or  any  other  even  number).  Let  us  consider  what  further  detection,  or  per¬ 
haps  even  correction, . capability  that  we  could  have  at  the  cost  of  additional 
parity  bits.  If  instead  of  selection  the  parity  bit  for  an  odd  number  of  Is, 
we  use  multiple  parity  bits  such  that  there  are  always  some  multiple  of  three 
Is  in  each  word.  Such  a  method  would  require  either  two  or  three  parity  bits 
depending  on  the  elimination  of  the  use  of  an  all  Os  word.  Even  this  rather 
trivial  example  illustrates  the  close  ties  between  the  machine  design  and 
the  redundancy  techniques  to  be  used.  The  use  of  such  multiple  parity  bits, 
would  reduce  the  number  of  undetected  error  conditions,  although  it  would 
not  catch  all  combinations  of  two  errors. 

Another  extension  of  the  use  of  multiple  parity  bits  is  to  use  a  matrix 
data  configuration  with  parity  checks  on  both  the  rows  and  the  columns.  Any 
single  bit  error  will  result  in  one  row  parity  failure  and  one  column  parity 
failure.  These  two  parity  failures  will  represent  the  "coordinates"  of  the 
incorrect  bit  location,  and  therefore,  will  allow  the  correction  of  such  a 
single  error.  Checks  of  this  type  are  often  made  in  magnetic  tape  data 
blocks,  and  are  shown  in  Figure  1  for  5  rows  and  6  columns. 
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Figure  1:  Even  parity  checks  for  matrix  data. 

73 


« 


If  even  parity  is  used  for  the  row  and  column  checks,  the  corner  parity 
bit  is  useful  as  a  "check  on  the  checks,”,  i.e.,  the  corner  bit  should  be 
the  proper  parity  bit  for  both  the  row  parity  bits  and  the  column  parity  bits. 
The  use  of  odd  parity  bits  does  not,  in  general,  result  in  a  correct  corner 
check  bit.  The  use  of  even  parity  is  successful  because  it  essentially  forms 
the  modulo  two  sum  of  all  the  bits  in  the  block.  This  sum  is  the  3ame,  re¬ 
gardless  of  whether  the  sum  is  taken  over  the  rows  or  over  the  columns.  The 
above  does  bring  out  a  subtle  difference  between  the  use  of  even  and  odd 
parity.  The  two  are  often  considered  to  be  (and  usually  are)  quite  inter¬ 
changeable. 


2.2  HAMMING  CODES 


Since  three  bits  can  define  eight  unique  states,  one  might  consider  at¬ 
taching  three  parity  bits  to  a  group  of  eight  data  bits  with  the  idea  of  de¬ 
tecting  the  location  of  an  erorr.  Further  investigation  reveals  that  one  of 
the  eight  states  is  required  for  the  "no  error"  condition,  and  hence  only 
seven  data  bits  could  be  used.  One  possible  assignment  of  the  parity  checks 
would  be  as  shown  in  Figure  2. 
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Figure  2:  One  possible  parity  check  arrangement 


One  problem  with  this  parity  check  scheme  is  the  lack  of  any  check  on 
the  check  bits.  Since  a  sizeable  fraction  of  the  bits  are  used  as  parity 


checks,  there  is  a  high  probability  that  some  of  those  bits  would  eventually 
cause  errors  rather  than  merely  correct  other  errors.  However,  the  scheme 
can  detect  the  error  location  for  any  of  the  seven  data  bits,  since  each 
location  is  checked  by  a  unique  combination  of  parity  checks.  For  example, 
if  partiy  checks  #1  and  #3  indicate  an  error  but  #2  does  not,  the  error  must 
be  in  bit  location  5. 

The  Hamming  code  utilizes  a  similar  parity  checking  scheme,  except  that 
the  partiy  bits  are  included  in  the  group  of  seven  protected  bits.  This 
leaves  four  bits  for  data,  and  the  code  is  often  called  a  (7,4)  code  indi¬ 
cating  the  total  number  of  bits  and  the  number  of  data  bits.  The  Hamming 
(7,4)  code  is  shown  in  Figure  3. 

Figure  3  also  shows  the  Hamming  (7,4)  code  representation  for  the  de¬ 
cimal  numbers  from  zero  to  nine  and  demonstrates  error  correction  by  means 
of  an  example  in  which  a  ONE  is  lost  in  the  6th  bit  location.  The  parity 
error  pattern  (110)  points  to  the  location  in  error  and  hence  correction 
can  be  made  by  simply  complementing  that  bit.  We  will  find  later  that  cor¬ 
rection  in  arithmetic  processes  is  not  as  simple  as  merely  complementing  one 
bit  since  errors  may  propagate  due  to  carries  or  borrows.  However,  for  the 
transmission  of  data,  the  bits  are  considered  to  be  independent. 

The  notions  of  "distance"  and  "weight"  are  also  shown  in  Figure  3.  Since 
the  terms  are  not  uniquely  defined,  we  should  call  these  the  Hamming  weight 
and  the  Hamming  distance,  to  distinguish  between  the  Hamming  and  arithmetic 
distance  and  weight.  The  Hamming  weight  is  the  number  of  non-zero  digits 
appearing  in  the  code  symbol,  and  the  Hamming  distance  is  the  number  of  digit 
positions  in  which  two  code  symbols  differ.  For  error  detection,  the  Ham¬ 
ming  distance  must  be  at  least  two,  while  for  correction  of  a  single  error 
or  detection  of  two  errors,  the  distance  must  be  at  least  three.  In  general, 
the  distance  between  any  two  allowable  code  symbols  must  be  at  least  2n-1  to 
correct  n  errors  or  to  detect  2n  errors.  Richards  [16]  states  that  error 
detection  can  be  traded  for  error  correction  since  one  cannot  generally  have 
the  full  amount  of  each.  This  trade-off  can  be  seen  by  a  simple  example  of 
a  distance  three  code.  We  could  detect  up  to  two  errors  by  such  a  code,  but 
we  could  not  distinguish  between  one  error  and  two  errors.  Hence,  we  would 
run  the  risk  of  falsely  "correcting"  a  bit  position  if  we  tried  to  correct  it 
when  two  errors  had  actually  occurred.  We  must  either  assume  single  errors 
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Figure  3.  The  Hamming  (7,4)  code  and  examples  of 
its  use. 


and  correct  any  error,  or  we  would  do  no  error  correction  but  could  detect 
any  pattern  of  one  or  two  bits  in  error.  For  the  simultaneous  correction  of 
t  or  fewer  errors  and  the  detection  of  d  or  fewer  errors,  one  needs  a  dis¬ 
tance  of  at  least  t-d-1  [4] . 

The  parity  matrix  of  Figure  3b  indicates  by  a  1  the  bits  which  are  in¬ 
volved  in  each  parity  check.  Note  that  no  distinction  is  made  as  to  which 
bit  is  the  actual  parity  bit.  The  arrangeement  of  Is  in  the  parity  matrix 
is  such  that  the  pattern  of  partiy  errors  "points”  to  the  location  which  is 
in  error.  Correction  is  by  merely  complementing  that  particular  bit. 

Larger  Hamming  coded  words  can  be  built  up  by  the  use  of  k  check  bits 
and  n«2k-l  total  bits.  Some  of  these  values  are  listed  in  Table  1.  Note 
the  advantage  of  using  long  words  rather  than  several  shorter  words,  e.g., 
by  coding  bytes  separately. 


Code 

Total  No. 
of  Bits,  n 

No.  of  Check 
Bits,  k 

No.  of  Information 
Bits,  (n-k) 

(7,4) 

7 

3.' 

4 

(15,11) 

15 

4 

11 

(31,26) 

31 

5 

26 

(63,57) 

63 

6 

57 

Table  1.  Various  Hamming  Codes 

2.3  CYCLIC  CODES 

The  arrangement  of  the  1,0  pattern  in  the  Hamming  code  parity  matrix 
was  chosen  for  each  of  locating  the  defective  bit  position.  Another  ar¬ 
rangement  of  interest  is  that  of  the  (7,4)  cyclic  code,  namely: 


The  check  digits  are  chosen  to  be  the  leftmost  three  bits  and,  as  can  be 
seen  by  the  parity  matrix,  they  check  the  three  bits  which  occur  after  an 
intermediate  location.  A  particularly  simple  encoding  scheme  for  this  ar¬ 
rangement  is  shown  in  [ 1  3 . 

Cyclic  codes  are  based  on  the  algebra  of  polynomials.  For  example,  the 
(7,4)  cyclic  code  is  developed  from  a  3rd  degree  (7,4)  polynomial  generating 
function , 

g(x)  =  1  +  X!  +X1 
•  7 

which  is  a  factor  of  1-x,  (we  use  x  since  n  =7).  The  preceding  parity  ma¬ 
trix  can  be  found  as  follows  [4] : 

h(x)  =  (1-x)  *g(x)  = 1+x2+x^+x^ 

h(x)  =1011100 
xh(x)  =0101110 
x’h(x)  =0010111 

The  H  matrix  is  formed  from  h(x),  xh(x)  and  x*h(x)  with  the  order  of  the 
elements  reversed.  Our  P  matrix  is  the  same  as  H  except  for  the  reversal  of 
the  elements,  since  we  wanted  the  parity  bits  on  the  left  to  be  consistent 
with  the  previous  parity  and  Hamming  code  usage. 

The  use  of  such  generating  polynomials  has  been  extended  to  a  number 
of  such  codes  which  can  detect  and  correct  either  burst  or  random  errors. 
Burst  errors  are  errors  which  affect  several  adjacent  bit  positions  and 
hence  are  of  definite  practical  concern.  Such  codes  are  of  special  interest 
since  considerably  less  redundancy  is  required  for  the  correction  of  a  given 
number  of  errors  if  such  errors  are  confined  to  a  burst  [1,4,8], 

Any  (n,m)  cyclic  code  can  detect  a  burst  of  length  n-m  or  less,  which  is 
a  considerably  larger  capability  than  when  the  errors  are  randomly  dispersed. 


!0  0  1  1  1  0  1  | 
0  1  1  10  10/ 
1  1  1  0  1  0  0  J 
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2.4  CODES  FOR  ASYMMETRIC  ERRORS 


Some  types  of  logic  devices  may  have  definite  failure  modes  which  allow 
one  to  assume  that  all  errors  are  Is  becoming  Os  (or  vice  versa  for  other 
types  of  devices).  This  limitation  might  allow  one  to  develop  more  efficient 
error  detection  and  correction  techniques. 

One  such  technique  is  the  Berger  code  which  uses  an  extra  k  bits  to  rep¬ 
resent  the  number  of  Os  in  the  data  word  (or  the  number  of  Is).  The  number 
of  bits  in  the  check  symbol  is: 

k  =  1  + log*m  m  =  no.  of  data  digits 

The  above  code  does  not  correct  errors,  but  can  detect  all  combinations  of 
errors  with  the  limitation  that  only  Is  becoming  Os  are  possible  (or  vice 
versa) . 

2.5  FIXED  WEIGHT  CODES 

The  Hamming  weight  of  a  binary  code  symbol  is  its  number  of  Is.  A 
fixed  weight  code  has  the  same  number  of  Is  in  each  code  symbol.  One  such 
example  is  the  "two  out  of  five"  code,  in  which  two  of  the  five  bit  positions 
are  Is,  and  the  other  three  are  Os,  such  as  01010. 


3.0  ARITHMETIC  CODES 

Those  coding  schemes  which  have  been  discussed  in  Section  2.0  are  pri¬ 
marily  for  error  control  during  the  transmission  of  information.  When  one 
has  to  perform  arithmetic  operations  on  the  information,  those  code  bits  do 
not  perform  any  useful  check  and  may,  in  some  cases,  make  the  arithmetic  op¬ 
erations  more  difficult.  For  example,  if  the  information  is  coded  in  the  two- 
out-of-five  code,  a  normal  binary  arithmetic  unit  could  not  be  used  without 
requiring  a  conversion  to  binary  code.  Some  other  codes,  such  as  the  use  of 
a  single  parity  bit  or  a  Hamming  code,  allow  the  separation  of  the  check  bits 
and  the  information  bits.  However  none  of  these  codes  provide  a  check  on  the 
arithmetic  process  itself. 


Some  codes  will  now  be  considered  which  provide  protection  during  infor¬ 
mation  transmission  and  also  provide  a  check  on  the  arithmetic  process.  As 
one  would  expect,  this  additional  check  will  cost  us  something.  In  the  cases 
of  interest  this  cost  is  both  additional  check  bits  and  a  somewhat  longer 
check  method. 

3.1  RESIDUE  OPERATIONS 

Before  considering  how  residues  can  be  used  for  checking  in  arithmetic 
operations,  let  us  review  a  few  properties  of  residue  arithmetic. 

The  residue  of  an  integer  number  x  modulo  tft  is  the  remainder  of  the  di¬ 
vision  x/m.  As  examples: 

26/3  =  8  with  a  remainder  2 ;  26  module  3=2 

5/3  =  1  with  a  remainder  2;  5  modulo  3=2 

17/8  =  2  with  a  remainder  1;  17  modulo  8  = 1 

or  in  general; 

x  modulo  m  =  r  where:  x=q»m+r, 

x  &  q  are  integers 
m  &  r  are  positive  integers 

The  residue  can  be  indicated  by  various  notations  including 

x  modulo  m 
x  mod  m 
m|x 


The  latter  notation  will  be  used  in  this  report  to  indicate  the  residue  of 
x  modulo  m. 

Several  residue  operations  are  of  interest  in  error  detection  and  cor¬ 
rection  including  the  following  identities  (see  [13]  for  proofs). 


1 .  Residue  of  multiples  of  m; 
E.g.  :*  15  modulo  5=0 


| km|  =  0  for  k  = integer 

2.  Addition  of  multiples  of  m;  I (x+km)  =|x| 

1  m  1  m 

E.g.:  (9+15)  modulo  5=9  modulo  5  =4 

3.  Addition  and  subtraction;  I  (x=y)l  =  i(|x|  + 1  yl  ) 

E.g.:  (9+7)  modulo  5=  (9  modulo  5)  + (7  modulo  5)  =  (4+2)  modulo  5=1  or  2 

-  -  m  - 

4.  Multiplication;  | (x-y) lm  =  I ( i x[  m*| y|  J | m 

E.g.:  (9*7)  modulo  5=[(9  modulo  5) *(7  modulo  5)]  modulo  5  = [3] 

modulo  5=3 

Another  property  of  interest  is  the  "casting  out  9s"  for  decimal  opera¬ 
tions,  or  in  general,  casting  out  (b-1)s  for  the  number  base  b,  and  n  an  in¬ 
teger  (except  for  b=2  and  n=1).  For  example,  the  residue  modulo  9  of  a  de¬ 
cimal  number  can  be  found  by  "casting  out  9s"  from  the  digits  of  the  number, 
or  stated  otherwise,  adding  the  digits  in  modulo  9  arithmetic. 

J 196T2 1 g  =  19672  =  7  (casting  out  9s> 

J 1 9672  J g  =  ( 1+9+6+7+2)  =  7. 

3.2  THE  USE  OF  RESIDUES  FOR  ARITHMETIC  ERROR  DETECTION 

Just  as  the  parity  bit  is  concatenated , i . e . ,  attached  to  one  end  of  a 
data  word,  the  residue  could  also  be  used  as  such  an  attached  error  check. 

For  example,  the  residue  for  1967  modulo  9  is  5, and  this  number  could  be 
written  in  error  coded  form  as  5  1967.  Suppose  that  we  want  to  add  two  such 
numbers, as  shown  below. 

2135 
1967 
sum  =  4102 
diff  =  0168 


1 - 

For  examples,  consider  m  =  5,  k  =  3,  x=9  and  y  =  7. 

** 

Note  that  the  residue  can  never  be  negative.  In  this  example  a  9  was 
added  to  the  residue  to  make  the  final  residue  positive. 
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check 


2  2135 
5  1967 
7  4102 


check** 


The  check  on  the  addition  is  made  by  comparing  the  residue  of  the  sum 
with  the  sum  (modulo  9)  of  the  residues  of  the  two  operands.  If  the  compar¬ 
ison  is  not  "true,"  i.e.,  not  the  same  values,  then  an  error  has  occurred. 

If  the  comparision  is  "true,"  the  computation  is  assumed  to  be  correct.  How¬ 
ever,  certain  errors  will  not  be  detected  when  their  net  effect  adds  up  to 
some  multiple  of  the  modulus,  e.g.,  in  the  above  example,  if  the  sum  were 
4192  rather  than  4102.  The  larger  the  modulus,  the  less  the  probability  of 
such  missed  errors.  Paal  [11]  utilized  a  combination  of  modulo  31  error 
coding  and  a  repetition  of  the  algorithm  to  ensure  that  large  magnitude 
errors  were  not  passed  over  due  to  this  undetectable,  multiple  of  the  modulus, 
type  of  error.  In  the  second  execution  of  the  algorithm,  only  10  bit  accuracy 
was  utilized,  which  was  adequate  to  ensure  that  the  magnitude  of  any  "missed 
errors"  would  be  less  than  0.1%.  The  use  of  the  modulo  31  residue  required 
that  5  check  bits  be  used  on  each  word,  which  in  itself  would  detect  some  97% 
of  all  error  patterns,  i.e.,  tne  fraction  30/31  of  all  errors. 

The  STAR  computer  [17]  utilizes  a  residue  code  for  instruction  words  in 
which  a  four  bit  check  symbol  is  attached  to  each  28  bit  address  word.  The 
check  symbol  is  (15  modulo  15  residue),  i.e.,  the  Is  complement  of  the  resi¬ 
due  so  that  the  sum  of  the  check  residue  symbol  and  the  instruction  residue 
should  be  zero  (represented  as  1111  due  to  the  use  of  the  Is  complement). 

This  approach  eliminates  the  need  for  a  separate  comparision  operation  be¬ 
tween  the  two  residues.  The  residue  calculation  is  by  "casting  out  15s" 
through  the  use  of  a  four  bit  adder. 

The  residue  check  method  of  error  detection  can  also  be  applied  to  mul¬ 
tiplication  by  use  of  the  identity 


x  y  lm  “  1 1  x  lm  l  ^  1ml  m 


An  example  of  the  checking  of  a  multiplication  is: 
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3.3  THE  USE  OF  RESIDUES  FOR  ERROR  CORRECTION 


In  an  n  bit  binary  word  there  are  2n  possible  error  "states,"  since 
each  bit  position  can  be  in  error  by  either  +1  or  -1.  Since  one  additional 
state  is  required  for  the  "no-error"  condition,  a  total  of  2u+ 1  states  are 
required  to  be  able  to  correct  an  error  in  the  n  bit  word.  This  requirement 
puts  a  constraint  on  the  lower  bound  of  the  modulus, namely  the  modulus  must 
now  be  at  least  2n+1  to  have  2n+l  unique  residue  values  (including  zero). 

As  an  example,  consider  a  5  bit  adder  for  which  n  =5,  and  m  =  2n+1  =11. 


H  0011  <01110 


error 


since  8  t  1 ,  an  error  has  occurred 


Correction  Code  Table 
Residue  Correction 


0 

1 


no  error 


2 


-2 


1 


3 


+2 


3 


k 


5 


10 


+2 


0 


The  error  correction  table  used  in  the  previous  example  can  be  built  up 
by  assuming  certain  errors  and  seeing  what  the  resulting  residue  is.  For 
example,  if  during  the  addition  of  zero  plus  zero,  a  1  occurs  in  the  2° 
column,  the  residue  will  be  1,  and  the  correction  should  be  to  subtract  2°. 
Hence  -2^  occurs  opposite  the  residue  1.  The  rest  of  the  table  can  be 
built  up  in  a  similar  manner. 


One  can  see  that  as  the  number  of  bits  in  a  data  word  gets  longer,  the 
number  of  distinct  residues  must  also  increase  and  hence,  the  modulus  must 
become  larger. 

3.4  THE  AN  PRODUCT  CODES 

The  basic  notion  of  a  product  code  for  error  detection  is  quite  simple, 
namely,  that  if  we  multiply  both  integer  operands  in  an  addition  (or  subtrac¬ 
tion)  operation  by  some  other  integer  A,  then  the  sum  (or  difference)  should 
also  be  an  integral  multiple  of  A.  That  is: 

AX  +AY  =  A(X  +  Y) 

We  can  check  for  this  "integral  multiple"  by  a  repeated  division  by  A  until 
the  remainder  is  either  zero  or  at  least  less  than  A.  The  remainder  should 
be  zero  (if  an  integral  multiple).  We  say  that  an  error  has  occurred  if  it 
happens  to  not  be  equal  to  zero;  otherwise  we  assume  that  the  answer  is  cor¬ 
rect  even  though  some  undetectable  error  may  actually  have  occurred.  We  will 
see  a  little  later  that  in  some  cases  we  can  use  the  remainder  value  to  de¬ 
termine  which  bit  position  was  actually  in  error, and  hence  can  have  error 
correction  as  well  as  detection.  The  similarities  between  the  AN  product 
codes  and  the  previously  discussed  residue  codes  can  readily  be  seen  in  the 
above  comments,  and  will  be  demonstrated  again  in  subsequent  paragraphs. 

Several  practical  considerations  influence  the  choice  of  the  multiplier 
A,  and  the  way  in  which  the  remainder  is  found.  Division  is  a  time  consum¬ 
ing  computer  operation  and  some  alternative  method  would  be  of  great  value 
in  determining  the  remainder.  We  will  see  that  a  good  choice  for  A  is 

A  =  bn-1  n  =  some  integer, 

b  =  number  base 

A' s  of  this  form  allow  the  remainder  search  to  be  performed  by  a  "casting 
out"  procedure  such  as  the  casting  out  9s  of  decimal  arithmetic  checking. 

In  binary  arithmetic  checking,  we  can  cast  out  3s,  7s,  15s,  etc.,  corres¬ 
ponding  to  a  choice  of  A  equal  to  3,  7,  15,  etc.,  respectively.  As  we  make 


p 


A  larger,  we  reduce  the  probability  of  accepting  an  erroneous  answer  as 
correct,  but  at  the  cost  of  additional  bits  in  the  coded  representation  of 
the  numbers.  These  relationships  are: 


A-1 

Fraction  of  all  errors  which  are  detectable  =  — 7— 

A 

l 

Extra  bits  for  the  redundancy  —  log^A. 


3.5  THE  AN+B  CODES 

The  AN+B  or  augmented  product  codes  are  used  when  ease  of  complement¬ 
ing  is  desired,  e.g.,  when  a  9s  complement  representation  is  desired  by 
merely  complementing  each  bit  position  of  the  binary  representation.  One 
example  of  such  a  code  is  the  3N+2  code.  The  complementing  action  is  shown 
below. 

Example  of  3N+2  code: 

Decimal  A  becomes  3 ( A )  +2  =  14“*0m0 
Decimal  5  becomes  3(5)+2  =  17“*  10001 
Note  that  these  two  numbers  have  complementary  1  &  0 
positions  and  that  A  &  5  are  9s  complements 

The  choice  of  the  values  of  A  and  B  are  determined  by  two  factors;  (1)  the 
A  must  be  selected,  as  for  the  regular  AN  codes,  as  a  relatively  prime  num¬ 
ber  with  respect  to  the  radix,  and  (2)  the  B  must  be  some  integer  solution 
of  the  equation  for  the  Is  complement: 

[rn-1]-[AN+B]  =  A[b-1-N]+B 

where 

n  =  highest  exponent  of  r 
r  =  radix  of  the  number  system 
b  =  number  of  states  that  a  digit  can  assume 
b-1-N  =  9s  complement  of  n  (if  b  = 10) 

Solving  for  B  we  obtain 
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B  =  i[  (rn-1  )-A(b-1 )  ].  For  A  =3,  r=2,  b  =  10,  and  n=5, 

B  =  (32-1 } — 3 ( 9 ) ]  =  2. 

Hence  3N+2  is  a  valid  solution  of  AN+B  for  these  values  since  2  is  an  inte¬ 
ger  solution  for  B.  A  3N+2  coded  BCD  is  shown  in  Table  2. 


N  3N+2  BCD  N  3N+2  BCD 

0  00010  5  10001 

1  00101  6  10100 

2  01000  7  10111 

3  01011  8  11010 

4  OHIO  9  11101 


Table  2.  3N+2  BCD  Code 

The  possible  occurrence  of  some  value  of  A  for  which  one  could  have  a  self¬ 
complementing  code  with  B  =  0  compels  one  to  try  to  solve  the  above  equation 
towards  this  goal. 

[rn-1]-A(b-1)  =0  rn  =  Ab. 

For  r  =  2,  2n  = Ab. 

Due  to  the  constraint  that  A  must  be  an  odd  number  (relatively  prime  with 
the  radix  2),  there  can  be  no  solution  for  integer  values  of  A  and  b.  Hence, 
we  must  use  an  AN+B  type  code  if  we  are  to  have  the  ease  of  complementing 
feature . 

The  added  difficulty  due  to  the  plus  B  term  is  not  only  the  nuisance 
of  having  to  add  it  at  each  encoding,  but  also  due  to  the  need  to  correct 
each  addition  and  subtraction  which  otherwise  end  up  of  the  form  An+2B  and 
AN+0B  respectively.  Multiplication  and  division  are  also  made  more  diffi¬ 
cult. 
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3.6  SUM  CODES 


The  sum  code  utilizes  a  separable  check  code  with  a  check  modulus. 
This  check  code  has  k  bits  such  that 


check  code  =  I (-xlr^l 
*  ‘a 

where  x  = number  being  coded,  r  = radix  (2  for  binary),  k  s number  of  check 
bits,  and  a  = check  modulus. 

The  sum  code  is  placed  at  the  right  (least  significant  digit)  part  of 
the  coded  number.  The  original  data  and  the  check  symbol  are  processed  sep¬ 
arately  and  a  checking  algorithm  is  used  to  check  for  proper  residue  values 
after  the  computation. 


4.0  OTHER  CONSIDERATIONS  AND  CONCLUSIONS 
4. 1  A  SUMMARY  OF  CODE  TERMINOLOGY 

In  this  section  we  have  discussed  codes  in  which  the  check  bits  were 
quite  distinguishable  from  the  information  bits  (called  separate  codes)  and 
also  codes,  such  as  the  AN  codes,  in  which  the  bits  are  nonseparable. 

The  parity  bits  of  various  codes  are  examples  of  "systematic"  codes 
since  there  is  a  functional  relationship  between  the  check  bits  and  the 
information  bits.  When  such  a  relationship  does  not  exist,  the  code  is  said 
to  be  non-systematic , 

Errors  can  either  be  randomly  dispersed  throughout  the  word,  or  they 
may  be  neighboring  bits.  The  latter  are  called  "burst"  errors  and  are  an 
important  practical  consideration,  since  errors  can  often  occur  as  bursts 
due  to  an  interference  or  transient  fault  which  affects  a  string  of  bits. 

The  weight  of  a  code  symbol  is  the  number  of  non-zero  digits  in  the 
symbol.  For  the  "arithmetic  weight"  the  number  must  be  represented  in 
minimal  form,  i.e.,  using  a  minimum  number  of  digits.  For  example,  01111 
becomes  10001  in  minimal  form. 

The  Hamming  distance  differs  from  the  arithmetic  distance  since  the 
Hamming  distance  is  the  (Hamming)  weight  of  the  modulo  sum  of  the  two  num¬ 
bers,  while  the  arithmetic  is  the  (arithmetic)  weight  of  the  difference  of 
the  two  numbers. 


4.2  FAULT  PROPAGATION 


There  are  several  ways  in  which  one  fault  can  propagate  throughout  an 
information  word  causing  more  than  one  damaged  bit  location.  One  simple  ex¬ 
ample  is  a  defective  carry  in  an  addition  which  could  propagate  for  some  dis¬ 
tance.  However,  this  error  is  not  really  a  multiple  error  since  one  correc¬ 
tion,  via  the  necessary  borrows  or  carries  could  restore  the  correct  result. 
In  other  applications,  the  damage  pattern  may  not  be  corrected  so  easily. 

The  two  major  ways  in  which  a  fault  can  propagate  in  a  damaging  manner  are 
in  byte  organized  processors  and  in  multiple  "cycle"  algorithms,  such  as  for 
multiplication  and  division,  in  parallel  processors,  or  even  for  addition  and 
subtraction  in  serial  machines.  One  method  of  checking  (and  perhaps  correc¬ 
tion)  would  be  to  perform  a  check  at  the  end  of  each  byte  or  cycle  in  the  al¬ 
gorithm.  Unfortunately,  this  might  increase  the  overall  time  for  such  opera¬ 
tions  (as  multiplication  and  division)  to  some  intolerable  value.  The  close 
tie  between  the  processor  design  and  the  type  of  redundancy  and  checking  is 
shown  by  such  difficulties.  For  a  more  complete  description  of  fault  propa¬ 
gation  and  the  use  of  "error  magnitudes"  the  reader  is  referred  to  Avizienis 
C  2]. 


4.3  FURTHER  STUDIES 

Coding  theory  is  a  very  rich  and  by  far  the  most  developed  branch  of 
fault-tolerant  computing.  For  a  simplified  least  mathematical  treatment 
the  interested  reader  is  referred  to  Lin  [IS].  For  an  encyclopedic  treatment 
the  reader  should  consult  Hamming  [20].  A  recent  interesting  work  by  a  pio¬ 
neer  of  coding  theory  is  Peterson  [19]  which  interrelates  coding  and  informa¬ 
tion  theory. 
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